ChatGPT, data privacy, and international data protection laws

ChatGPT burst upon the world like a miracle tool, acquiring 100 million active users within 2 months of its release. Many people are incredibly excited about the possibilities it represents and the amount of time it can save them from writing high school essays, checking code for bugs, and more.

ChatGPT is a machine learning (ML) language model built by the research organization OpenAI, and it’s vastly more powerful than any model that has come before. In the last few days, OpenAI released ChatGPT 4, a new version that’s even more powerful than existing versions.

The algorithm was trained on more than 500 billion tokens, from webpages, documents, social media threads, blog posts, ebooks, news media outlets, and other data sources, and is made up of approximately 175 billion parameters.

But for data privacy professionals, this combination of immense complexity and massive training datasets creates the potential for a data privacy nightmare. Recent news about a bug in ChatGPT’s open source library that leaked parts of users’ conversation histories only added to the sense of impending privacy problems.

There are a number of generally agreed privacy principles which form the backbone of most data privacy regulations, including the EU’s General Data Protection Regulation (GDPR) and California’s data privacy laws CCPA and CPRA. They include lawfulness, such as receiving informed consent from the owners of the data; maintaining accurate data; limits to the amount of time that data can be retained; transparency over the use of data; and more. The growing use of OpenAI’s ChatGPT raises concerns regarding many of these principles.

Lawfulness

Possibly the most fundamental privacy principle is that of lawfulness. When any organization collects data from someone, they must receive that person’s informed consent. This requires them to explain why the data is being gathered, how it will be used, and who will have access to it. ChatGPT was trained on billions of datasets that were scraped from the internet. When consumers agreed to share their data with these platforms, they gave consent for a specific purpose, like to access news articles or send photos to friends. They didn’t agree to OpenAI using it to train an AI language model.

According to internet security experts, OpenAI could be in breach of data privacy regulations. “If OpenAI obtained its training data through trawling the internet, it’s unlawful,” says Alexander Hanff, a member of the European Data Protection Board’s (EDPB) support pool of experts. “Scraping billions or trillions of data points from sites with terms and conditions which, in themselves, said that the data couldn’t be scraped by a third party, is a breach of the contract.”

A similar issue applies to the ebooks and articles that OpenAI drew on. When someone publishes a book or blog post online, they retain rights to the content. By consuming that data for its model, ChatGPT used it for a purpose that wasn’t authorized by the copyright holder, in contravention of copyright laws.

Additionally, ChatGPT may still be unlawfully acquiring and applying user data. The platform stores users’ IP addresses, browser types and settings, email addresses and phone numbers, as well as the prompts they write, the types of content they engage with, features they use, and browsing activities. However, there is no consent form, data is not anonymized, and no limits are set on the term of data storage. OpenAI also states that it may share users’ personal information with unspecified third parties without their prior consent.

ChatGPT recently updated its user warnings to request that people don’t include sensitive information in the prompts they write, but it’s unclear whether this is sufficient for data privacy purposes. And the recent leak only emphasizes that this is a real issue.

The right to be forgotten

Another serious privacy issue that ChatGPT disregards is the requirement that data be kept only for a limited amount of time. Once a company no longer needs the data for the purpose for which it was collected, it must delete the data. As mentioned above, OpenAI gives no timeframe for how long it will keep data, whether that data derives from user prompts or from training data scraped from the internet.

Furthermore, GDPR and most privacy regulations include the “right to be forgotten”; any user can request that a company, media platform, or any online presence deletes their information entirely, and they can request that at any time. Because ChatGPT comprises so many billions of datasets, it is close to impossible to delete a single individual’s information. It’s like removing a needle from a haystack, only the haystack is several miles high and you have to remove a hundred needles, one at a time.

Accuracy

Another privacy principle is that stored data must be accurate, and be used to deliver accurate information. ChatGPT is already notorious for its lack of accuracy, with journalists using the platform to “prove” that the earth is flat. Its training data only extends to the year 2021, so even a simple question like “Who is the Prime Minister of the UK?” might generate a wildly inaccurate answer.

Those are innocuous examples. ChatGPT might also produce false information about a company, a political party, an ethnic group, or an individual, which could spread around the world before it can be fact-checked and corrected. This lack of accuracy could be a serious issue, particularly in sensitive topics like race relations, climate change, or political fights. ChatGPT could write texts accusing people – even regular everyday users – of being criminals, which could damage their reputation and create unnecessary emotional distress.

AI natural language processing models can also produce hallucinations, meaning they generate content that’s nonsensical or unfaithful to the source content. Hallucinations can occur for a number of different reasons, including overconfidence in the system’s own hard-wired parametric knowledge. ChatGPT is at further risk of a cascade of hallucinations, because it uses its own previously-generated content to inform the next sequence of words. If the content is convincing, this could lead to mass deception as users share the content on other platforms without context. ChatGPT users need to be aware of this risk.

On top of that, there’s all the data that ChatGPT is gathering from user prompts. This might include intimate and embarrassing details about people’s lives, some of which may be imaginary, theoretical, or made up out of malice. The algorithm doesn’t verify whether data is true, and there are no safeguards to prevent it from revealing that information in answer to a different prompt.

Transparency

Organizations that collect and implement user data are also required to be transparent about the way they use data. At the same time, there’s a growing body of legislation that obligates artificial intelligence (AI) algorithms to be transparent about how they arrive at their decisions and what informs them.

ChatGPT, however, is so complex that it’s extremely difficult to understand how it arrives at a given decision, or in this case, piece of text. There is no transparency to how it uses data to produce new text, or the basis for any information that it shares as fact.

Bias

Finally, bias is a common problem for ML models. Organizations that run algorithms need to discover when and where bias appears in the results they deliver, so that they can correct for it and eliminate it as much as possible. For example, many ML models exhibit bias based on gender, race, or geographic location.

ML models are only as good as the data they are trained on. As mentioned above, ChatGPT uses so many datasets that it’s inevitable that it consumed outdated, discriminatory, and/or biased viewpoints, which could then appear in the texts it generates. OpenAI has taken steps to prevent ChatGPT from delivering racist, sexist, or other types of offensive statements, but with such vast datasets and hyperparameters, it’s difficult to guarantee that none will appear.

New legislation could raise further problems for ChatGPT

ChatGPT could soon be subject to even more legislation, in the form of the EU’s upcoming Artificial Intelligence Act, or the AI Act, which was proposed in April 2021 and is due to go to a vote some time in 2023.

The AI act will classify all AI applications as low, medium, or high risk, and apply frameworks according to the level of risk. It would prohibit some types of AI applications entirely, such as social scoring, and place extensive regulations on high risk apps, while low risk ones, such as spam filters and chatbots, would have close to zero regulations. It remains to be seen how the AI act will affect ChatGPT’s use of data.

ChatGPT: A force for the good, and a data privacy minefield

The conversation about ChatGPT’s need to comply with data protection laws is still ongoing, but as the initial shine wears off, you can expect to see more people voicing concerns around its data privacy implications.

Uzy Hadad

CEO