Social science research often involves collecting data through direct interactions with individuals, households, and communities. This data generally includes detailed personal and sensitive information. Additionally, data may be collected from marginalized and vulnerable communities, including those with a limited understanding of their rights as data subjects. Sensitive personal data intended for statistical analysis can be misused or cause harm to households and communities. A recent incident involved a non-governmental organization inappropriately collecting and sharing personal information from ethnic minorities with the government, which then used this data to verify households for potential repatriation1.
Given the risks, it is essential to handle identifiable data with utmost care to protect participant privacy and prevent any harm to the respondents. The Belmont report emphasizes protecting participants and minimizing harm, thus entrusting researchers with an ethical obligation and a professional responsibility to ensure that the personal identifiable information is stored securely to maintain confidentiality.
As a best practice, researchers use data de-identification to safeguard the ‘right to privacy’ of participants and ensure compliance with ethical research norms. In this blog, we discuss data de-identification and ways to de-identify data. Data de-identification is just one part of ethically using data – see 3ie’s Transparency work and our policy on Transparent, Replicable, and Ethical Evidence to learn more.
Understanding data deidentification
identificationData deidentification allows researchers to protect participant privacy while maintaining the analytical value of datasets for program implementers and decision-makers. This process involves systematically removing or altering personally identifiable information (PII) from datasets to prevent the identification of individual participants while preserving all the essential data points needed for statistical analysis.
The deidentification process requires careful consideration in identifying direct and indirect identifiers and successfully removing or masking them without losing important data points. Direct identifiers include information such as names or contact details, which can be used to directly identify respondents. Indirect identifiers include information such as age and education, which could be combined with other data to potentially identify respondents.
The objective of the data deidentification process is to strike a balance between protecting participant privacy while maintaining data utility, ensuring that researchers can continue to generate useful analyses without compromising the safety of study participants. This can be done by:
- Retaining identifiers for certain types of analysis, especially for cases involving long-term follow-up of respondents.
- Preserving data by maintaining both identified and de-identified datasets, limiting access to the identified data, and using de-identified data for all analyses where identifiers are unnecessary.
- Making only the deidentified datasets publicly accessible, once project is completed, to promote transparency and knowledge sharing.
- Securely storing data containing personally identifiable information to safeguard participants’ interests.
This dual approach allows a robust analysis while maintaining the highest standards of ethical research conduct and participant protection.
What should be de-identified?
Effective de-identification is not just about removing direct identifiers such as names and addresses. It requires careful identification and masking of indirect identifiers that could be combined with other information to re-identify individuals. Indirect identifiers such as age, location, occupation, income, and household composition may seem harmless individually, but can create unique profiles when combined with other information from the dataset or other secondary data. For example, a 40-year-old female teacher in a specific rural location with three children might be easily identifiable in a small community even if her name is not mentioned. Besides direct and indirect identifiers, sensitive variables, including health conditions, political affiliations, criminal history and religious beliefs, require special protection as they pose privacy risks and could be used to discriminate. The key principle of understanding what is to be deidentified is assessing combinations of independent data points that, when put together, could compromise participant anonymity.
Methods for de-identification
Depending on the context, effective de-identification can be achieved through several methods:
- Data masking: Hiding or modifying specific data elements. Names, contact details, and location are direct identifiers and should never be published. Instead, analysts should publish data with masked identifiers that could be linked across repeated observations over time. Example: Changing a birth date from "1980-05-14" to just "1980," or creating a unique ID for each household. A limitation of this approach is that masking may reduce the precision and utility of the dataset, potentially affecting certain types of analyses. Also, if not done carefully, masked data can still be vulnerable to re-identification when combined with other datasets.
- Data aggregation: Combining data to a higher level to make it less specific and prevent users from ascertaining the location of any individual household. Example: Instead of reporting data by individual zip codes, report it by region or county. A limitation of this approach is that excessive aggregation may limit the usefulness of the dataset for detailed policy or program evaluations, as it can mask important variations within subgroups, reducing the granularity of analysis.
- Suppression of small cell counts: Not reporting data when the number of individuals in a category is too small (e.g., fewer than five). This prevents someone from identifying those individuals. Example: If only three people in a town have a rare disease, don’t report that specific number to avoid singling them out. This method offers protection for individuals in rare or small subgroups, minimizing disclosure risks. However, extensive suppression may lead to biased results by disproportionately excluding small but potentially important subpopulations.
- Noise addition: Introducing controlled statistical noise to numerical data to prevent exact matching while maintaining overall patterns and relationships. This approach allows data to be shared while retaining analytical value at an aggregate level. However, excessive addition would result in limited accuracy and poor reliability of results. Poorly designed noise-addition methods can sometimes be reversed, increasing the risk of re-identification.
- Rounding-off: Reporting rounded numbers to prevent re-identification. Example: rounding income to the nearest 100 or 1000 to avoid re-identification. A limitation of this approach is that rounding can introduce measurement error, especially for small sample sizes or variables with low variance. While deidentifying, remember to keep detailed records of all deidentification procedures and store linking files separately with restricted access for potential future data updates or corrections.
This blog is the second in a series of blogs where we share our experiences promoting open science best practices and ethical data management as part of the TREE initiative at 3ie. The first blog of this series discussed creating datasets that ensure computational reproducibility. In the next part, we will discuss data storage and archival processes in detail.
[1]https://www.hrw.org/news/2021/06/15/un-shared-rohingya-data-without-informed-consent