Patient data is crucial in developing new treatments, drug development and disease prevention. In particular, genetic data combined with clinical data which helps researchers understand the cause of a disease and accurate prognosis.
These and other applications of personal and sensitive data might be useful for researchers, but raises a substantial risk to the user’s privacy. If information finds its way to employers or to insurance companies it can influence their decisions in hiring or insurance prices. Secondly, researchers may focus on the data itself rather than on individual patients, which can help to eliminate bias and ensure that the research is more objective. Furthermore, the growing popularity of at-home genetic testing has sparked worries about the misuse of genetic information and the potential abuse of personal details, which has raised issues of privacy which are typically not well understood by consumers.
Anonymizing patient data is a crucial practice in the field of medical research, helping to balance the need for privacy with the need for important research to be conducted.
The common solution when using sensitive data for research is anonymizing the data in a way that it cannot be identified by the recipient of the information. This involves the process of removing unique and personal identifiable information from the data sets, such as email, SSN, name, phone, name, address, zip code and more.
Even if one removes all identifiable information from the data set, there is still a risk that it is possible to de-anonymize it. De-anonymization is a reverse process in which we join the anonymized data with other data resulting in the identification of the personal information in the data set.
Many studies have shown that the anonymization technique might be insufficient to protect privacy of data. The challenge with all the generic processes of anonymization is that they are data dependent. This means that the process for anonymization depends on the uniqueness of the data after we remove the identifiable columns and the available data in space. In fact, in many cases it is very difficult if not impossible to validate that after the anonymization the data exposed cannot be cross-correlated with additional data that will reveal the user’s personal data.
Learning from another industry, the Netflix Prize was a data science competition launched in 2006 by the streaming giant Netflix. The goal of the competition was to improve the accuracy of Netflix’s recommendation system, which used anonymized data on users’ movie ratings to suggest new titles for them to watch.
The competition was open to anyone, and over 50,000 teams from around the world participated. One researcher who made an impact during the Netflix Prize was Arvind Narayanan, an assistant professor at the University of Texas at Austin. Narayanan took the data that Netflix released as part of the competition and joined it with data from the Internet Movie Database (IMDb).
By combining these two datasets, Narayanan was able to create a more complete picture of the movies that Netflix users were watching and the ratings they were giving them. In this case, the researchers were able to de-anonymize the Netflix Prize dataset by using information from IMDb to connect the anonymous records to specific individuals. The results of the study suggest that even though the Netflix Prize dataset was anonymized, it was still possible for an adversary with access to certain types of external information to re-identify individual records and potentially uncover sensitive information about those users.
This case and other research highlights the importance of carefully considering the risks of re-identification when anonymizing data, and the need to use strong privacy techniques to protect the privacy of individuals such as differential privacy, but the last method also depends on the data and it is never generic.
Protecting patient data proves to be critical and anonymizing data can help to reduce the risk of personal information being misused. However, we know that even anonymized data can be vulnerable to re-identification, highlighting the need to use strong privacy techniques to protect sensitive data.