Back to Blog
Security best practices
What is data lineage and how is it changing data security?
Data lineage, also known as data provenance or data tracing, is the process of tracking data as it moves within an organization to understand its origins, the ways it's been modified, as well as who is using it and how. These are effectively the “What, Where, Who, and Why” of the data being created, modified, and shared within your organization. This added context about data, allows security teams to better protect it from theft and misuse.
In this article
Traditional data protection technology classifies data that is sensitive by matching patterns in the content, like regular expressions and keywords, user-applied tags and fingerprinting, which cover a limited range of data types. Data lineage is an entirely new way to classify sensitive data that classifies more data types while reducing false positives. It has substantial implications for improving how companies identify, investigate, and report on data security risk and incidents.
How data lineage is transforming data classification
Until now, many data security products operated like the scanners that passengers encounter at security checkpoints in US airports. Tests by the U.S. government using mock guns, knives, and explosives found that airport scanners fail to detect over 70% of contraband material. Conventional data loss prevention (DLP) operates in a similar way, scanning content when you send an email or upload a file, but relying on brittle indicators to accurately capture risk.
It turns out scanning content on its way out of the company is not very accurate at flagging important information so a lot of it can slip past these tools. Data lineage adds an important piece of context—the history of the data. Returning to the airport example, data lineage is like putting an Apple AirTag on every object in the world. If an object that originated in an explosives factory heads to the airport, it should not be allowed on an airplane. Similarly, data from Salesforce should not be uploaded to an employee’s personal Dropbox.
Beyond the content, you can learn a lot about data based on its lineage:
- Where it originated: Whether a customer database in Snowflake, the source code repository in Github, or the product design in Figma, different types of data originate in different places.
- How it was handled: Data moves in recognizable ways, passing through the board meeting site in SharePoint, the client documents folder in Google Drive, or the employee offer letter account in DocuSign.
- Who added to it: Different employees produce different work, from researchers who develop drug formulas, to designers working on new products, to accountants who compile financial results.
This type of visibility is important for making informed decisions about specific incidents. For example, knowing an employee modified a file extension of a spreadsheet with customer data from .xlsx to .png before attempting to upload it to a personal datastore helps you determine if they are acting maliciously or simply making a mistake. At a broader level, data lineage can give you information about how employees go about carrying out business functions in their normal day-to-day work. This can tell you, for example, if IT needs to invest in new tools or if new processes should be put into place to make work more efficient.
How does data lineage work?
Data lineage can use endpoint telemetry, browser plugins, and cloud APIs to collect information on events related to each piece of data including:
- Actions: copy, paste, modify, etc.
- Location: application, website/domain, machine hostname, etc.
- User: username, directory group, departure date, etc.
Today, most organizations have billions of unique pieces of data, with even more events when you count every time a piece of data is shared, or moves to a new location, or is modified in some way. Because it’s your data that ultimately matters (not just files), data lineage technology is designed to track each data-related action for each piece of sensitive data even as it’s copied or moved between applications and files, providing a more granular view of data movement.
A data lineage is calculated by combining these billions of events in a graph database and tracing these events together. Graph databases have traditionally been used in a variety of different industries like social media, fraud detection, customer sentiment analysis, and more. They allow for the computation of relationships between different pieces of data. Consider, for example, the problem of trying to determine the degrees of separation between any two accounts on a social media platform.
For example, LinkedIn uses graph databases to tell you whether the users you are interacting with are 1st, 2nd, or 3rd degree connections. Recent advances in graph database technology have made it possible to calculate connections that are many steps removed from one another, the equivalent of 100th degree connections on a social network. This innovation makes data lineage possible because data often takes a long journey within the company, with dozens or even hundreds of events occurring as it is moved, shared, and edited.
By performing this computation in the cloud it can quickly be carried out without introducing noticeable latency for end users on endpoints or elsewhere.
Why data lineage is now essential
Data lineage cannot be understood without appreciating the subtle transformation that data security has undergone over the last decade. With the perimeter breaking down, data security is less about blocking anything from ever leaving your corporate intranet or servers, and more about protecting data across your extended enterprise in the cloud. Today, policies have to be nuanced to properly address risk. For example, in your organization, perhaps customer data is allowed in Google Drive, however, if the permissions on a document containing customer data are too broad it would constitute a violation of policy.
This makes traditional tools like legacy Data Loss Prevention (DLP), which relies on fingerprinting, content inspection, tagging, classification ill-suited for modern environments, even when accurate. Additional pressure from regulation, like GDPR and CCPA, also simply require that organizations have better visibility into the lifecycle of their data, which is another area that legacy tools struggle with. The two key implications here are that data lineage greatly increases the visibility and context that security analysts have in identifying data exposure risks. This ultimately increases the speed and efficacy of investigations into such incidents.
Getting visibility into how data flows
In many instances, IT and security teams face unknown unknowns—that is to say, it’s hard for them to know the shape policy violations take in the absence of context and visibility. Data lineage plays a crucial role in enabling this. Being able to visualize how data is flowing, from origination to its current destination gives security teams the capability to tackle data sprawl, which is a significant problem for organizations.
Data sprawl emerges when files with sensitive content are replicated and moved into new environments where access permissions may be different. Think about an HR employee downloading a report with employee names and salaries from Workday and then uploading that file to a Google Drive folder accessible to every employee in the company.
Data lineage is a surefire way to get ahead of this by providing greater visibility into how sensitive data moves within the company, including where the data is stored and who has copies of it, so that the appropriate controls are put in place. This additionally provides insights into user behavior, such as how employees are interacting with sensitive data, what types of unsanctioned apps (if any) they might be using, and more. All of this is invaluable for not just remediating data security incidents, but informing IT and security teams about substantive changes they can make to enable employees and make their lives easier. For example, if employees use free (and risky) tools to convert files to PDF, you can instead invest in a secure enterprise application providing this functionality.
Enabling and accelerating security investigations
There are more security alerts than ever going to SOC teams. Data lineage helps by reducing the number of false positives, so analysts can focus on real incidents. But it also helps by reducing the time it takes to investigate an alert by providing analysts with more context up front to quickly understand user intent and take appropriate action.
Traditionally, data loss prevention and insider threat tools provided analysts with only enough information to know what incident took place and what immediate action triggered the alert. Any additional context, like events that might have preceded the incident, were outside the scope of the alert. Having this additional context and associating it with the incident is what enables analysts to truly understand an incident. This can help determine if an employee made a one-off mistake or if they’ve engaged in a malicious pattern of behavior to steal sensitive data.
Consider, for example, an employee triggering an alert after uploading a product design file to their personal Dropbox account. In order to understand if this was just a careless employee who didn’t know policy, or someone intentionally trying to steal data, you need to see more events surrounding the incident. If after being alerted that they can’t upload the file to Dropbox, the user then encrypted the file in a .zip and added it as an attachment via WeChat, then it’s clear their behavior is deliberate. Data lineage makes it possible for security analysts to see this pattern of behavior by connecting events together to make it easy to investigate an alert.