The Ins and Outs of Sensitive Data Discovery

Data privacy is an increasingly hot topic all around the world. Consumers today are highly aware of the need to protect their data. Companies in all industries recognize the importance of complying with data privacy regulations and only work with partners and third parties who adhere to similar standards.

At the same time, data is the fuel that keeps every company moving. Ceasing to collect and use data is no longer an option; the only possibility is to safeguard all sensitive data. The first step requires discovering all the sensitive data that enterprises use within their solutions, so as to be able to verify that access is sufficiently restricted and the data is protected.

What is sensitive data discovery?

In simple terms, data discovery refers to the process of identifying the data that’s gathered, stored, and used across your organization’s systems. It’s usually carried out using auditing tools, which scan applications, networks, and/or endpoints for sensitive data. Companies might use data loss prevention (DLP) solutions, access brokers, or other data policy monitoring or enforcement tools.

Data privacy regulations, such as GDPR and CCPA, define sensitive data as specific types of personal information. This can include

Personal data revealing racial or ethnic origin, political opinions, and religious or philosophical beliefs;
Trade-union membership;
Genetic data, and biometric data processed solely to identify a human being;
Health-related data;
Data concerning a person’s sex life or sexual orientation.

Additionally, certain kinds of personal identifying information (PII) can be considered sensitive data, depending on the context. For example, email addresses that are connected to details about medical treatments would count as sensitive data, even though it’s not normally classified as such.

Successful data discovery requires locating every instance of sensitive data within the organization’s storage and operational networks. It can be extremely challenging, but it’s the foundation of many crucial privacy precautions, and facilitates data classification which categorizes files according to their vulnerability.

Sensitive data discovery – the threat is real

In the past, some companies might have been tempted to push the complex and far-reaching issue of sensitive data discovery under the carpet. But that was never a safe gamble, and today it’s even more unwise.

Many countries and regions have laws and regulations governing the protection of sensitive data, such as GDPR in Europe and HIPAA in the United States. Ignoring these threats can result in legal penalties, fines, and sometimes lawsuits for non-compliance. Today, many contracts between companies include clauses that assign liability and accountability for actions that result in data privacy non-compliance and/or data breaches.

The financial impact can be significant, including costs for incident response, notifying all those affected, legal fees, regulatory fines, and potentially lawsuits from data subjects affected by the breach.The liability that third parties can face for any failure to protect sensitive data can be immense, so sufficient sensitive data discovery is crucial.

What’s more, companies that ignore sensitive data are more likely to miss vulnerabilities within the system that could lead to data breaches, and overlook sensitive data which is at risk of being exposed, which can have severe consequences. Sensitive data breaches can damage the organization’s reputation and erode trust with customers, partners, and investors, sometimes for many years after the event.

Tracking who has access to sensitive data and how it is stored helps prevent many types of attacks that result in data breaches, including phishing, social engineering, malware, XSS, APTs, and more. While cybersecurity plays a key role, taking care of access to and storage of your sensitive data is equally vital. Two high-profile data breaches occurred due to information being stored in a location without sufficient security. Twitter exposed all its user passwords by placing them in an internal log without encryption, while MyHeritage compromised over 92 million user accounts when a file holding email addresses and passwords was kept on a private server.

Many breaches occur due to a hacker targeting a third party vendor, whose security profile is outside of your control. For example, Mondelez International, the parent company of Oreo and Ritz, suffered a privacy breach through a third party vendor that exposed the personal data of over 50,000 employees. Monitoring which parties have access to sensitive data could have helped prevent this incident.

Compliances and regulations that require sensitive data discovery

Besides being a vital first step in any robust data privacy policy, sensitive data discovery is also mandated by a number of different data regulations. GDPR, PCI-DSS, HIPAA, SOX, and section 170 of the UK’s Data Protection Act are among the legislation that requires organizations to track and list the sensitive data it collects and uses.

Additionally, it’s close to impossible to respond to data access requests without efficient data discovery processes. Most data privacy regulations protect the rights of data subjects to request access to the data that you collect about them, and information about how that data is used. If your data discovery system is reliable, you’ll be able to quickly reply with a comprehensive list of the data you hold and the parties that can access it.

The unique use case of sensitive data

Data tracking is always a challenge, but sensitive data discovery is particularly difficult. Depending on the context, the same data can be considered “sensitive” or not sensitive, which means that it’s not possible to identify sensitive data in isolation. It’s vital to consider what other data is connected to it and when this data is utilized and accessed.

Data is being acquired all the time and at high speeds, making it very hard to keep up with data acquisition. At the same time, a typical enterprise could be working with dozens of third party vendors, and access permissions can change all the time.

Additionally, sensitive data comes in from many different sources and is stored in numerous locations and varying formats. There’s rarely any single clear owner for sensitive data, with the result that the CIO, CISO, CPO, and other stakeholders share overlapping, poorly-defined responsibilities. This inevitably leads to blind spots.

Legacy data discovery tools tend to check only part of your data ecosystem, allowing more blind spots to arise. Manual data discovery is simply unable to keep up with the challenges of tracking which sensitive data is being collected, where it is stored, the different context in which it is used, which third parties can access which datasets, and where there are integrations that can reach your data.

How does the process of sensitive data discovery work?

Manual data discovery processes are laborious, time-consuming, and not very reliable. The steps include:

Interviewing everyone involved in gathering data from users, customers, or partners, to find out every place where sensitive data is stored
Collecting sensitive data from all the storage locations
Entering all the sensitive datasets into a single spreadsheet, together with their location
Surveying all stakeholders to find out who has access to each dataset
Reviewing the results

Recently, companies began automating the above steps. This speeds the process up a great deal, making it possible to keep up with the volume of data coming in, but it isn’t able to overcome the challenges of understanding the context that defines sensitive data. Human input is still required to confirm context-driven sensitive data and connect all the dots, and tools need

to be integrated into the production environment, with all the attendant drawbacks and difficulties.

A newer approach uses artificial intelligence (AI) to scan code for sensitive data. Unlike manual or automated traditional sensitive data discovery, AI code scanning is accurate, continuous, and aware of the context that defines sensitive data. It delivers alerts whenever code deviates from best practices of sensitive data management. Code scanning takes place in dev environments, rather than in production, which makes both review and any fixes much faster, lower cost, and requiring less effort.

Topic	Manual sensitive data discovery	Automated database scanning	AI code scanning for sensitive data discovery
Time and effort	Slow and laborious	Saves time and energy but still some manual work	Very fast and automatic
Third-party vendor	Manual	Manual	Automatic
Data lineage	Manual	Manual	Automatic
AI Data Privacy	Manual	Manual	Automatic
Context provided?	Provided manually by the auditor	Context set during setup hence it is static	Dynamic discovery of data context
Incident response	Manual	Manual	Automatic
Frequency	Intermittent	Intermittent	Continuous
Data flow detection	Manual	Manual	Automatic
Remediation process	Long, tedious, and risky. Requires returning to production to fix the code	Long, tedious, and risky. Requires returning to production to fix the code	Quick and painless, since it’s carried out while still in development
Access to databases	Already granted to relevant employees	Needs to be given to the third party tool	Not needed

Which tools can help in the sensitive data discovery process?

As you can see, manual data discovery is not practical. There are many tools and services to help organizations automate and streamline their data discovery.

The benefits of using data discovery tools

Data discovery tools can automate the entire process, thereby reducing the risk of human errors and saving you both time and money. They are easier to use than any manual data discovery

process, and more constant. When you rely on individual employees to carry out data discovery, you’ll be left at sea when they retire or move to a different company, taking with them vital knowledge about data collection and storage.

The right tools are more accurate and reliable than manual data discovery, and deliver valuable context which manual processes tend to leave out. Context is vital such as which other data is stored alongside a given dataset can help reveal which data should be prioritized and where vulnerabilities lie in your ecosystems.

Types of data discovery tools

Today, data-heavy enterprises can choose from a number of different types of data discovery tools. These include database scanners and storage scanners, which sift through all the data in all your databases or storage locations to identify all the data you collect, where it’s located, and which parties can access it.

Third party scanners take a slightly different approach. Instead of surveying all the data you collect, they search for data that third parties can access. These scanners list all the access credentials for your datasets, helping reveal which data is at greater risk of a breach.

Last but not least, there are code scanners. These are lightweight tools that use natural language processing (NLP) together with large language models (LLM), both of which are types of artificial intelligence (AI), to examine code for instances of data usage and access. They produce a list of all the third parties that can access your data and which data you share with them, deliver information about data flow and context, and reveal where sensitive data is being stored within your ecosystem.

The drawbacks of traditional data scanning tools

The many data scanning tools and services out there were not all created equal. Traditional options like database scanners are expensive to deploy and complicated to install and integrate into your systems. To run effectively, you need to invest time and effort defining the tools’ access to your databases and parameters like the user and timing of the scan.

These tools have to be deployed in production, which could impact performance and service. It also requires navigating a complex system of network segmentation, security restrictions, and access management, to ensure that the tool can reach all data no matter where it’s located. Last but not least, granting permission for yet another tool to access sensitive data is itself a liability risk.

While third party scanners are quicker and easier to deploy, they are also less comprehensive. These scanners only reveal which data is accessible to external partners or vendors, but they don’t give much information about internal access, context, or data flow, or sensitive data that is currently going unused in an insecure storage location.

Let AI classify and discover sensitive data – it starts from the code

In contrast, newer code scanners that use AI, LLMs, and NLP are far more trustworthy and easy to use. Unlike other tools, they don’t require access to databases or storage. By scanning code before it reaches production, it can extract information about which data is stored, storage locations, third party access, and more, all without affecting production.

With AI code scanners, you can see full context, flow, and entities of sensitive PII data, as well as tracking the code contributors to increase accountability and build an audit trail. Code scanners help you find sensitive data early, saving time and money and reducing the risk of data breaches. They are also easy to integrate, low cost, and agentless.

Discover Privya’s low-cost, easy to use, reliable AI code scanner today

Uzy Hadad

CEO