Data Privacy: Top Techniques for Protecting Patient Information

Reading Time: 12 mins

Note: This is the fourth blog in our ongoing healthcare series on “Data Privacy in Healthcare and The Role of Technology.” This blog series deep-dives into data privacy and transparency in the healthcare industry. It explores in detail the compliance and disclosure requirements in the global pharmaceutical industry and the international laws and regulations that guide them.

This blog series also discusses the traditional manual methods of anonymization currently prevalent, how industry 4.0 solutions can automate and vastly improve conventional anonymization, and how Gramener’s AInonymize solution can transform clinical trial disclosures and regulatory compliance in the healthcare industry.

Recap: In the last article, we explored the data privacy laws in different countries and how EMA 0070 and Health Canada PRCI balance privacy with transparency.

Precap: This article will discuss the different types of data privacy protection techniques in detail, how AI is transforming the healthcare industry, and the privacy and security concerns surrounding the use of ML and data analytics.

Check out other parts of the series:

Need for Ensuring Data Privacy During Submissions in View of Regulatory Requirements

More than 434,000 annual clinical studies were globally registered till 24th Nov 2022. This number is growing YoY. Clinical trials are essential for the discovery of new medicines, public health, and other forms of healthcare innovation.

Clinical trials comprise several phases and pose innumerable challenges to researchers. One such challenge involves compliance with multiple privacy policies across numerous countries. In addition to conducting the trials, sponsors and researchers are also responsible for other facets of the operations.

This includes the anonymity of the participants and following the laws and regulations that govern them. Failure to abide may result in steep penalties, hefty fines, and even lawsuits. To avoid this, healthcare operators must possess substantial knowledge about data privacy regulations.

Clinical trials use data that fall under Protected Health Information (PHI), thereby prioritizing its security. Any information that can facilitate the identification of an individual is covered by PHI, according to the HIPAA Privacy Rule.

This includes photos, social security numbers, health conditions, medical records, emails, addresses, names, etc.

What techniques do healthcare companies use to protect the PHI of their patients and clinical trial participants? Can the use of analytics and machine learning technology improve these methods?

Read on to find out!

Types of Techniques

Every stakeholder is obligated and responsible for ensuring that the personal data used in clinical trials is adequately protected. If this sensitive data is breached in any way, it can hurt the sponsors by way of heavy penalties and even lawsuits.

It can also cause serious harm to the clinical trial participants whose intimate details will suddenly be made public.

The following are four techniques used by healthcare organizations to protect the confidentiality of patient information and clinical trial participants.

Redaction

The process of removing sensitive information from documents such as medical records is known as redaction. Healthcare companies redact internal and external information to ensure no one’s security or privacy is compromised.

Once redacted, even documents that carry sensitive information can be published easily.

Each year, almost 15 Mn Americans fall prey to identity theft. Data redaction can help protect individuals’ identities and personal information.

Data redaction and data anonymization are not the same. In data anonymization, the information is hidden, whereas in data redaction, it is completely deleted.

How Does Data Redaction Work?

Data redaction is a simple process and consists of only 3 steps:

Scanning documents to identify the Personally Identifiable Information (PII) that has to be redacted
Removing all PII
Storing the redacted files so that they can be used later

Data Redaction Best Practices

Following are some of the best practices when redacting data, especially from medical records:

Before redaction, save a copy of the original document. Otherwise, you could permanently lose a word, phrase, or paragraph that has been redacted by mistake.
Files that are redacted in PDF ensure that the information does not fall into unscrupulous hands. Any attempt to convert them into other formats like Word will lose all the information.
Print the photocopied or scanned document before manual redaction. Next, hide the information using a black marker. Lastly, scan the redacted paper and convert it into a PDF format.
When redacting the medical records, remove all the attributes. Otherwise, nefarious individuals may use the document’s metadata to view sensitive information.

Importance of Data Redaction in Healthcare

Strict regulations governing patient data make medical records redaction essential in the pharma and life science industries. Many laws make it mandatory to redact clinical trials and drug discovery processes. For example, HIPAA protects privacy by prohibiting certain data types from being shared with third parties or the public without redaction.

It also mandates that organizations protect privacy by redacting data before it enters the public domain.

Medical record redaction protects the privacy rights of patients. Any company that exposes the private information of individuals without their consent risks being sued under applicable laws, resulting in lawsuits that cost millions and irreparable loss to reputation.

What Kinds of Medical Data Should Be Redacted?

The following kinds of medical information should ideally be redacted:

Information That May Cause Harm

Healthcare providers and organizations redact medical records to protect patients and their families. Regulations such as EMA 0070 and GDPR require organizations to balance public disclosure and transparency with data privacy, necessitating the redaction of medical records.

This includes any information that can help identify an individual, such as birthdates, social security numbers, phone numbers, addresses, or names.

3rd Party Information

Not every stakeholder with access to medical documents may have the right to view third-party information. This could lead to cases of fraud or identity theft.

An individual’s confidential information may be used against them by dubious individuals. Family and friends of patients may have to share private information with doctors in the patient’s absence. The rights of all such parties are sacred.

As such, protecting the interests of all parties involved is essential.

Complying With Regulations

International regulations like GDPR, EMA0070, and HIPAA necessitate healthcare providers to protect the PHI of their patients. These regulations also disallow institutions that come under their purview from publishing patient records without their consent.

Anonymization

Data anonymization is a form of information sanitization that involves the encryption or removal of personally identifiable data in a dataset. The goal of data anonymization is to protect the privacy of its subjects. It also minimizes the risks of information breaches or leaks during data transfer or sharing.

Since it does not alter the data format, the information can still be analyzed and used post-anonymization.

The EU’s General Data Protection Regulation (GDPR) mandates the anonymization of stored information of EU residents. Anonymized data sets are not personal data and are hence not covered by GDPR. This allows healthcare operators to distribute anonymized documents more freely without breaking the law or violating the rights of individuals.

Data anonymization is also an integral component of HIPAA in the US. The HIPAA regulation governs how Private Health Information (PHI) is used in the healthcare industry and its partners.

What Are the Different Kinds of Data Anonymization Techniques?

The following are some of the techniques used to anonymize sensitive data.

Data Masking

Data masking involves modifying sensitive data. There are two types of data masking – static data masking and dynamic data masking. Static data masking uses anonymized data to create a mirror version of the database.

Dynamic data masking modifies data in real-time while it is being accessed. This type of anonymization can be carried out using dictionary substitution, character or term shuffling, or encryption.

Generalization

Generalization omits certain data to make the information less identifiable. The data may also be changed using a value range with logical boundaries. For example, a house number of a specific address may either be deleted or replaced by an arbitrary value as long as the latter is within 200 house numbers of the original value.

Generalization removes particular identifiers without compromising the accuracy of the data.

Data Swapping

Data swapping, also known as data permutation or shuffling, redistributes the dataset attribute values so that they no longer match the initial information. Rearranging attributes such as columns that contain recognizable values like date of birth is a key feature of anonymization.

Data Perturbation

Data perturbation uses random noise and rounding methods to tweak the initial dataset. The disturbance employed must be proportional to the values used. The choice of base is crucial to modify the original values.

A small base may not be able to anonymize the data sufficiently. On the other hand, a large base may render the data unusable or unrecognizable.

Synthetic Data

Synthetic data is not connected to any real case and is algorithmically produced. Instead of risking privacy and protection by modifying or utilizing the original dataset, synthetic data is used to create artificial datasets.

This approach uses mathematical systems based on features or patterns in the original dataset. Synthetic outcomes are created by using statistical methods like medians, standard deviations, linear regressions, etc.

What are the Different Kinds of Identifiers That Can be Anonymized?

Identifiers that are unique & can directly identify a single individual are known as direct identifiers. This includes a social security number, email address, or phone number.

Individually, quasi-identifiers cannot identify a single individual. However, combined with other quasi-identifiers, they can single out a person. For example, job titles such as CEO or CFO are quasi-identifiers since they apply to many individuals.

However, when dealing with a dataset of only a single organization, the individuals associated with these titles may be readily identified.

The following direct & quasi-identifiers can be anonymized:

DICOM
Zip codes
Unique characteristics of the patient
Admission dates
Faces
Date of birth
Medical record numbers
Social security numbers
IP addresses
Cellphone, fax, or telephone numbers
Email addresses
Names

What Kind of Formats and Data Types Can be Anonymized?

The following data formats of medical images, documents, and videos can be anonymized:

Videos – mpg, mpeg, webm, avi, mov, mp4
Images – webp, bmp, png, jpg, jpeg
Documents – pdf
Medical images – dcm (DICOM)

How Can Anonymized Data Improve Healthcare?

Large amounts of medical data are generated each day globally. Rapid transfer and analysis of health data can accelerate cost reduction, disease prevention, improved quality of care, and faster medical decisions. It can also drive innovative healthcare solutions.

Removing or anonymizing personal information that can identify an individual is the first step toward addressing privacy concerns and complying with regulations which results in a superior outcome.

Pseudonymization

Pseudonymization is a process made popular by its GDPR adoption. The GDPR describes pseudonymization as a mechanism for security and data protection by design. Pseudonymizing electronic healthcare records helps preserve the data confidentiality and privacy of patients.

Pseudonymization makes it easier to comply with HIPAA, which lays down the guidelines for handling healthcare data in the US. Under GDPR, if pseudonymization is properly carried out, it can help relax, up to a certain degree, the legal obligations of the data controllers.

While both HIPAA and GDPR allow and recognize pseudonymization, they differ in the legal definition of the process. GDPR classifies pseudonymous data as personal data, whereas HIPAA allows data to be shared if the stipulated data fields are pseudonymized.

Definition

According to the GDPR, when personal data can no longer identify the subject of the data without additional information, it can be considered pseudonymized data. However, the regulation stipulates that pseudonymized data should not be irreversible.

To be GDPR compliant, the additional information that can connect the pseudonymized data to an identifiable natural person has to be stored separately and be subject to organizational and technical measures.

ISO/TS 25237:2017 defines anonymization as any process where personal data is altered irreversibly. There is no longer a way to identify the subject of the data, either with the help of the data controller or by collaborating with a third party.

Thus, pseudonymization and anonymization are not the same.

What are the Benefits of Pseudonymization?

One obvious benefit of pseudonymization is hiding the identity of data subjects from a third party, especially in the context of patients and clinical trial participants.

Digitalized patient care and healthcare services facilitate studies and research that combine complex and large data sets from myriad sources. Pseudonymization helps mitigate the privacy risks this data sharing poses to individuals.

Pseudonymized data can therefore be put to secondary uses like drug customization, scientific research, policy assessment, comparative analysis, and other health-related initiatives.

Under GDPR, adequately pseudonymized data is

A failsafe to ensure that new data processing is compatible [Article 6(4)]
An organizational and technical measure to help enforce data minimization principles (Article 25)
A security measure that will reduce liability and notification obligations by making data breaches less likely to risk the freedoms and rights of natural persons (Articles 32 – 34)

Techniques of Pseudonymization

An EU Agency for Cybersecurity (ENISA) report has explored the technical solutions that can help with the implementation of pseudonymization.

In theory, pseudonymization maps identifiers like email addresses, IP addresses, or names to pseudonyms. For a pseudonymization process to be successful, each pseudonym corresponding to its respective identifier should be different from another pseudonym corresponding to its counterpart identifier.

Otherwise, it will not be possible to reidentify the pseudonyms. However, if the process permits, multiple pseudonyms may be attributed to a single identifier as long as there are clearly established rules of reidentification.

Pseudonyms are assigned to their original identifiers using a process known as the pseudonymization secret. Since it is indispensable to the efficacy of the pseudonymization process, the resulting secret must be protected using adequate organizational and technical measures.

To ensure that the identifiers do not fall into the wrong hands, the pseudonymization secret must be separately stored from the dataset. Only authorized personnel, through strong control policies, should have access to this secret.

Furthermore, if the pseudonymization secret is digitally stored, it must be encrypted. This requires proper storage and key management.

The following are some widely used pseudonymization techniques.

Counter

This is the simplest pseudonymization technique. A monotonic counter chooses a number that substitutes the identifiers. To prevent ambiguity, the counter cannot repeat the values it produces.

Simplicity is the biggest advantage of this technique. However, in sophisticated and large datasets where the complete pseudonymization mapping table needs to be stored, this solution may present scalability and implementation issues.

Random Number Generator (RNG)

The RNG mechanism produces values that possess an equal chance of being selected from the total set of possibilities. An identifier is then assigned these unpredictable values. This mapping can be created in any one of the following ways:

A cryptographic pseudo-random generator
A true random number generator

Since it is difficult to extract initial identifier information without compromising the mapping table, RNG offers strong data protection. Again, subject to the scenario, scalability may pose a challenge if the complete pseudonymization mapping table has to be stored.

Cryptographic Hash Function

A cryptographic hash function maps input strings of arbitrary length to fixed-length outputs. To obtain the respective pseudonym, the hashing function is directly applied to the identifier. The length of the digest produced by the function determines the corresponding pseudonym.

A hashing function provides strong data privacy. However, since it is prone to dictionary attacks and brute force, it is sometimes considered a poor pseudonymization technique.

Message Authentication Code (MAC)

MAC is described as a keyed-hash function since it requires a secret key to generate the pseudonym. The pseudonyms and the identifiers cannot be mapped without the knowledge of this key. The most popular design of MAC used in internet protocols is HMAC.

MAC is considered a sound data protection technique since, unless the key has been compromised, reverting the pseudonym is not possible. Methods may vary depending on the scalability and utility requirements.

Encryption

Contrary to popular belief, encryption is not an anonymization technique. Since it requires an encryption key or “secret” to map a pseudonym to an identifier, it renders the ciphertext a pseudonym and personal data.

The block size of the cipher to be used determines the length of the identifier that will be pseudonymized using encryption.

Advanced cryptographic techniques like Fully Homomorphic Encryption (FHE) may anonymize encrypted data since they allow operations on encrypted data without decryption. However, high computing overhead renders FHE inefficient and unfeasible to process personal data.

Deidentification

Data de-identification breaks the link between an individual and the data associated with him or her. This involves transforming or removing personal identifiers. The data de-identification process makes it easier to reuse and share the data with third parties.

Data de-identification is primarily governed by HIPAA and is mostly associated with medical data. Other frameworks like GDPR, CPRA, and CCPA also govern it. HIPAA allows data to be de-identified in any one of the following two ways.

Safe Harbor

To ensure that no residual information may be used for identification, this method requires the removal of the following 18 different types of identifiers:

Any unique identifying code, characteristic, or number
Biometric identifiers
Full-face photos and comparable images
Internet protocol addresses
Serial numbers and device identifiers
Web URLs
Vehicle identifiers and serial numbers, such as license plates
License numbers/Certificate
Health plan beneficiary numbers
Account numbers
Medical record numbers
Email addresses
Social security numbers
Fax numbers
Geographical data
Telephone numbers
Dates, except years
Names

These identifiers can render health information as protected health information (PHI), limiting disclosure and use. The Safe Harbor method is cost-effective and simple but is not ideal in all use cases. Sometimes, it can be overly restrictive, resulting in unusable de-identified data.

Other times, the process may be too liberal, allowing the data to be vulnerable to reidentification.

Expert Determination

Expert determination applies scientific and statistical principles to data to vastly reduce the risks of re-identification. This flexible method can help customize the de-identification process to use cases, maximizing utility.

Since this approach requires statistical experts, it can be an expensive option. However, the use of quantitative methods paves the way for automation.

What are the Benefits of De-Identified Data?

Data sharing leads to better treatment and tools to improve patient care and outcomes. However, HIPAA mandates the protection of patient information, which cannot be shared without the patient’s consent or knowledge.

De-identified information can be shared with others to further medical treatment and research. De-identified data also removes certain liabilities pertaining to HIPAA violations.

De-identified data also facilitates the collaboration of large data analytic platforms. For example, 14 leading healthcare companies partnered to form a new company that enhances care insights using big data analytics.

The company de-identified the information of tens of millions of patients across 40 states from thousands of care facilities to help drive analytic initiatives. Providers can share de-identified patient data to advance medical processes while protecting patient privacy and complying with HIPAA.

Unfortunately, like most disruptive technologies before it, data analytics and AI face sharp criticism from their skeptics. Their applications raise privacy and ethical concerns and may lead to medical errors.

In the next section, we will discuss how AI is performing a transformative role in healthcare and how that may lead to privacy and security concerns.

Can AI Help Healthcare Data Protection Techniques?

Healthcare organizations must share clinical trial data for further research. At the same time, International regulatory standards like EMA 0070 & Health Canada have strict clinical trial data anonymization guidelines in place to protect patient privacy.

Healthcare providers routinely play catchup when trying to share clinical data while protecting patient privacy cost-effectively. Since clinical records comprise vast amounts of unstructured data, this manual process can take weeks or months.

Furthermore, manual processes are also prone to errors, substantially increasing the risks of data breaches that can result in hefty fines & penalties.

Medical documentation, images, videos & research records are unstructured medical datasets that are difficult to work with. Artificial intelligence solutions like Natural Language Processing (NLP) & computer vision can help computers to “understand” unstructured data like human language, images & videos.

The following are two of the leading challenges faced by manual anonymization & how AI can help tackle them.

Identify PII with High Accuracy

The manual process of identifying PII in clinical summaries is exceedingly time-consuming & requires formidable domain expertise. High-accuracy analytics algorithms can reduce this turnaround time by a staggering 95%, from weeks to just minutes!

Regulatory Compliance

CSR documents must adhere to rigorous regulatory standards. Risk optimization algorithms & optimization algorithm automation can help align with regulatory authorities & balance transparency with privacy.

Conclusion

This article explored the need to ensure healthcare data privacy during submissions in accordance with regulations and the different types of data privacy protection techniques like redaction, anonymization, pseudonymization, and deidentification.

It also discussed the broad role that AI and ML play in transforming healthcare and the associated security and privacy concerns.

Don’t miss the next and final article in this series on data privacy in healthcare and the role of technology, where we will deep-dive into AI-nonymize, Gramener’s data privacy protection solution that leverages AI and ML to drastically cut down processing costs and increase speed and accuracy while ensuring that the rigid guidelines and aggressive submission timeframes of international regulatory standards like GDPR and EMA0070 are strictly adhered to.

Data Privacy Protection Techniques To Safeguard Patient Data