November 18, 2021

The Power of Graph Analytics to Protect Your Data

Cyberhaven has upended what the industry has come to expect from old-school DLP by introducing data protection that is far more reliable, easier to use, and can be applied consistently to any type of data or content. Unlike the traditional signature and tagging-based approaches that have dominated DLP for years, Cyberhaven introduces a novel approach to data protection that leverages graph analysis to let organizations see and control their data and risk in a new light.

Cyberhaven has upended what the industry has come to expect from old-school DLP by introducing data protection that is far more reliable, easier to use, and can be applied consistently to any type of data or content. Unlike the traditional signature and tagging-based approaches that have dominated DLP for years, Cyberhaven introduces a novel approach to data protection that leverages graph analysis to let organizations see and control their data and risk in a new light.

In this three-part series, we take a deeper look at the technology and innovations that make this possible. In this blog, we will cover the basics of what graph analysis is, and how we can apply it to the flow of enterprise data to build new, smarter ways of protecting enterprise data. In the following blogs, we will dive into some of the unique innovations that allow Cyberhaven to analyze at a depth and speed that lets organizations protect their data in ways that were not previously possible.


Graph analysis is uniquely powerful for data protection because instead of looking at a piece of data at a single point in time, it can show the flow of how data is created, shared, transformed, and consumed through its entire history. Instead of just being defined by its bytes, graph analysis of data flows lets us understand the value of data in a business context. For example, by tracking the flow of data we can see if a user downloads data from WorkDay, copy/pastes it into a spreadsheet, and then uploads that data to Google Drive. Our use of graph analysis can recognize that the same sensitive data is in Google Drive, or alternatively, could prevent it from being shared there in the first place.
And while graph analysis itself isn’t new, using graph analysis to protect enterprise data was very new and introduced new technical challenges. This required the Cyberhaven team to develop entirely new technologies that were capable of taking any piece of data, understanding its complete enterprise history and context, and then acting on that context in real-time.

With that in mind, let’s take a closer look at what makes graph analysis so powerful for data protection, how Cyberhaven is breaking new ground in graph analysis, and ultimately what it means for your security program.

Why Data Flow Graph Analysis for Data Protection?

Before getting into why graph analysis is so useful, it is important to know what graph analysis is. Mathematically speaking, a graph is a mathematical concept that shows the interconnection and relationships in data. A data flow graph shows how various instances and derivatives of each piece of data are connected by user actions. For example, a highly simplified graph could look something like the figure below. In this graph, we could follow how a piece of data is related between multiple users and resources.

For example, let’s suppose Janice saves an unreleased earnings report from the Company’s CFO folder to her team’s OneDrive. At this point, that content may be accessed by other users and systems and subsequently spread to other locations. It is then downloaded by John, who emails it to Rebecca, who then shares it again and saves it to her personal cloud drive. Now suppose you want to find where all the copies and derivatives of that data are in the enterprise. Data flow graph analysis can follow the data as Janice, John and Rebecca move it around and traverse any and all branches of this graph to easily deliver the answer. Next, let’s say we want to ensure that no data from the CFO folder is being uploaded to risky or unapproved locations such as a user’s personal cloud. Once again, graph analysis can walk all paths of the graph to follow the data from the CFO folder and recognize that important data is about to be sent to a risky location and take action. We can also see exactly how the user was able to gain access to the file in the first place and take corrective action.

Even from this very simplified example, we can see that a data flow graph is a natural way to think about how data is shared, consumed, and modified. Organizations are largely defined by how they can leverage their most important data. How can internal teams safely collaborate on important projects? How can they work efficiently with their client’s data while keeping this sensitive data safe? How does the organization leverage its intellectual property without putting it at risk? In almost every case, data needs to be used in order to have value — and the usage of that data is ideally defined by relationships. Where is the data from? Who is allowed to use it? How can it be shared and over what boundaries? Graphs are the natural, intuitive way of not only visualizing these relationships but also protecting and controlling them.

A Better Approach to Enterprise Data

The advantages of graph analysis become even clearer when compared to the old-school approaches of data security and DLP. Traditionally, DLP tools had two main strategies for identifying and controlling data – content-based signatures and content tagging. Both approaches have big challenges that graph analysis resolves.

Many DLP products try to apply signatures to identify sensitive content in much the same way that old-school antivirus and IPS products use signatures to find threats. However, there are a variety of problems with this approach. Unlike an AV signature that is designed to detect a very specific threat, organizations cannot write a signature for every single document that they want to protect. Instead, they need their signatures to apply to an entire class of documents, say for example, sensitive financial documents. These documents differ enough that it is almost impossible to define signatures that will apply to every file. If signatures are too specific, they will fail to detect sensitive information (false negative) and if they are too broad, they will incorrectly flag non-sensitive content (false positives). Both options are unacceptable.

This problem is compounded as organizations try to apply data protections to more diverse types of data. For example, it might be easy to write a signature for a credit card number, yet almost impossible for other critical data such as internal product plans, financial projections, or design files, just to name a few. The problem gets even more complicated as users edit and modify content, making it harder for signatures to keep up. Ultimately, this means signatures are only able to protect a small minority of the most predictable, structured enterprise data, while everything else remains unprotected.

Tagging allows security teams to add specific identifiers or tags to sensitive assets. Much like a detective placing a GPS tracker on a suspect’s car, security staff or the end-users themselves can apply tags to content to track and control where they go. But much like the GPS tracker, tagging only protects data that security knows about and preemptively applies tags to. Other copies of the same data may not be tagged and controlled. Additionally, end-users may make mistakes when applying tags or forget to apply them altogether. Users may inadvertently or intentionally remove tags to prevent the data from being tracked. Data can be converted into file formats that don’t support tags such as CSV files. Ultimately, tagging requires significant work, is applied to content by exception, and is prone to being disabled.

Data flow graph analysis changes all of this. Unlike the GPS tracking metaphor above, Cyberhaven’s graph analytics can be thought of as an omniscient system of cameras that tracks and correlates the movement of every action within a city. Now instead of just tracking a single car, we can track everything the suspect does. If the suspect changes his appearance or changes cars, it doesn’t matter.

Graph analysis brings this sort of omniscience to data security. Security teams can see the full story behind any and every piece of data and all its derivatives in the enterprise without having to do any upfront tagging or signature development. The graph builds the context automatically. And while the camera analogy above can sound a bit like big brother, in practice it is far less invasive than signatures or tagging. Graph analysis can be applied passively to any data even without having to inspect the content itself. Likewise, since graph analysis will know where data is coming from, we can selectively inspect data from a corporate source while ignoring data coming from an employee’s personal Facebook page. This can be extremely powerful in that it allows organizations to apply strong controls without having to actually expose sensitive content for analysis or running into potential data privacy issues.

Hopefully, this provides a solid background of why data flow graph analysis is so transformative for the protection of enterprise data. In the next blog, we will begin to dive into the “how” of the Cyberhaven approach and the need to build incredibly rich graphs and to analyze across many vectors in real-time.

Start tracing your data