February 4
1pm ET / 10am PT
Learn More
Back to Blog
9/1/2025
-
XX
Minute Read

Why Legacy Data Loss Prevention (DLP) Fails: Insights from Cyberhaven's VP of Sales Engineering, Jon Loya

No items found.

Confronted with a rise in sensitive data breaches, businesses are under pressure to efficiently protect their information while overcoming myriad technical limitations. In a recent video, Jon Loya, VP of Sales Engineering at Cyberhaven, shared valuable insights on the challenges of data loss prevention (DLP) and introduced Cyberhaven's cutting-edge strategies for tracking sensitive data within organizations. For those looking to unlock the secrets of effective DLP, this post breaks down Loya's key points from the video.

Methods for Tracking Sensitive Data

Loya first explains the three basic legacy methods for tracking sensitive information: data maps, tags, and labels. These approaches can be assessed for both their breadth (the range of file types they can classify) and their persistence (how well they maintain those classifications when files are duplicated, transferred, or shared). Loya shows that each of the three legacy approaches has its limitations. He then explains how lineage—an innovative approach championed by Cyberhaven—overcomes many of the challenges faced by legacy methods.

1. Data Maps

Loya describes a data map as "a treasure map of where your sensitive information is found." Files are scanned for sensitive content, such as PCI, PII, or other regulated data, and a visual map is generated that shows which files contain which types of sensitive data. 

Loya argues that data maps have "pretty good breadth" in terms of what sorts of files they can classify as containing sensitive data. Their shortcoming is their almost complete lack of persistence: any changes to the location of sensitive data in the system—such as when a sensitive file is duplicated—won't appear on the map until the following scan, and these scans are often scheduled only quarterly or even yearly. This can leave companies in the dark about where exactly sensitive content is located in a file system.

2. Tags

Tags offer a more persistent alternative to data mapping, with greater breadth as well. As Loya points out, tags can be used to flag nearly any type of file as containing sensitive data (so, good breadth), and the tag on the file will persist when the file is duplicated within the system (so, greater persistence than with data maps). 

Still, their persistence only goes so far: Tags don't transfer across different file systems, precisely because a tag is a feature of the system in which it was created rather than part of the content of the file itself. This makes tags an incomplete solution for tracking sensitive content.

3. Labels

The third approach is labels, which increase persistence by embedding meaningful content identifiers directly into sensitive documents. Because the label is part of the file, not the system, it persists across both simple document transfers and different file systems, ensuring sensitive data remains marked, regardless of where it lands (good persistence).

Unfortunately, labels have only limited breadth because they are "only supported within documents that support document headers," typically, office documents. Labels won't work in other file types that may contain sensitive content, such as text files, source code, images, and proprietary file formats. Again, like tags, labels offer only a partial solution to legacy tracking problems. 

4. Lineage: Cyberhaven's Innovative Solution

Cyberhaven addresses the limitations of legacy approaches and takes tracking to new heights with the concept of data lineage. Unlike the methods above, lineage considers the broader context in which files containing sensitive information are modified, copied, moved, or shared, and keeps track of the evolving relationships between original and derivative files. This sensitivity to the tracking history allows the lineage classification to persist even across different file systems, as the classification remains associated with all modification and movement events that occurred before.

Lineage thus eliminates the persistence issues found in tags and data maps. It also avoids the limitations on breadth encountered by labels, as lineage classifications can be maintained for any file type.

For organizations seeking comprehensive protection, Cyberhaven's data lineage approach offers the persistence and breadth that traditional methods often lack. 

Techniques for Content Inspection

Loya next explains the four approaches to content inspection by which sensitive data is identified: keywords, Regular Expression (RegEx), Exact Data Matching (EDM), and Optical Character Recognition (OCR).

1. Keywords

This approach draws upon a dictionary of terms and phrases (keywords) that indicate sensitive data. So, for example, files whose content contains the title of a particular top-secret project, say "Project X" (the keyword), can be flagged as sensitive.

There is a limitation to relying on keywords for content inspection: not all sensitive data can be represented by one or more unvarying keywords. Social Security numbers, for example, have a recognizable form, but the content of that form (i.e., the string of numbers themselves) is highly variable and resists static classification.

2. RegEx

RegEx (for "regular expressions") solves the above problem by identifying sensitive content based on its format rather than content per se. To return to the example of Social Security numbers, their XXX-XX-XXXX pattern of nine digits can generally be recognized. The content inspection software can be programmed to look for this pattern and flag it as sensitive data. 

The challenge is that such patterns are not always fixed. In some files, for example, the hyphens between the parts of a Social Security number might be replaced by spaces or omitted entirely, or there may be other adjacent digits, such as GDPR indicators, that would prevent RegEx from identifying the string as containing a Social Security number. On the other hand, if we try to remedy the situation by, say, broadening our RegEx to encompass any expression containing a sequence of nine numerical digits, we risk incorrectly flagging the occasional series of nine random numbers—leading to an increase in false positives.

3. EDM

Exact Data Matching is meant to mitigate this problem of false positives. Instead of looking for matching patterns as RegEx does, EDM involves searching the target file for specific, pre-identified sensitive data (e.g., data that matches a set of particular Social Security numbers located in a reference database such as a company's employee files). 

Of course, EDM's ability to identify sensitive data in this way depends on the accuracy of the data set being referenced. Keeping the data current may require regular updates to the dataset, which increases maintenance overhead.

4. OCR

An important limitation of the three preceding methods of content inspection is that they only work with text-based files. These methods cannot directly inspect for sensitive data that happens to be embedded within images, PDF files, and the like. This is where Optical Character Recognition (OCR) comes in: OCR can extract embedded text from images, allowing the text to then be inspected for sensitive data using one of the preceding methods.

However, OCR's capability to tackle image files comes at a cost: high resource consumption in the form of time, memory, CPU cycles, and processor power. Moreover, the rate of false positives remains relatively high.

The Problems of Obfuscation & Non-Text File Types

In addition to the limitations already mentioned, all four of the above content inspection techniques come up short when facing either of two challenges: obfuscation and non-text file types. 

Obfuscation occurs when data is intentionally formatted in a way to avoid being identified as sensitive data. For example, whereas a regular expression (RegEx) will readily recognize a datum like 888-21-5544 as a Social Security number, correct identification can be thwarted by simply replacing the hyphens with x's (888xx21xx5544). 

Likewise, the traditional methods of content inspection struggle with non-text file formats, such as CAD designs, audio files, or videos. 

Problem Case: Endpoint Content Inspection

One particularly thorny case that illustrates several of the key challenges faced by legacy approaches to content inspection is endpoint operations. Loya acknowledges that "nobody wants another endpoint agent." Still, he points out that organizations need endpoint monitoring for activities such as transferring data to USB drives, printing operations, or workflows that extend beyond corporate networks (e.g., an employee on their home network uploading data to the cloud). 

Most legacy DLPs must conduct real-time inspections of these endpoint operations. The process of doing so, Loya notes, is more complicated than many people realize, involving four processes: 

  • extraction of the data, 
  • loading the extracted strings into memory,
  • processing the data (using keywords, RegEx, and/or EDM, run in parallel or serially), and
  • parsing the results.

All this can require a significant amount of endpoint CPU and memory, which must be traded against time: Faster processing in real-time requires greater resource usage, which may then adversely impact performance on other tasks. 

Improved Performance with Cyberhaven's Lineage Solutions

Loya explains several ways Cyberhaven can effectively address performance challenges, such as those exhibited with endpoint content inspection.

First, in some cases, the use of lineage tracking obviates the need for any real-time endpoint content inspection at all: If the file involved already has an identifiable lineage that would suggest it contains sensitive data—for example, if the file originated from a secure repository such as Salesforce—then it may be unnecessary to carry out a content inspection. Legacy methods that don't take the file's derivation into account will miss this opportunity to avoid a superfluous inspection.

If content inspection is still desired, Loya says it can be moved up earlier in the chain, when the file is first being downloaded into the company's system. Not only does this allow the inspection to occur before data is put at any risk with an endpoint operation, but it also allows more time for the inspection, which means less drain on memory and CPU.

Finally, another resource-saving measure is to move content inspection from the endpoint operation to the cloud, where Cyberhaven can "crank up" the memory and CPU, completing the inspection more quickly and without adding to the processing load at the endpoint. This cloud-based approach also accommodates additional advanced processes, such as AI-based sentiment analysis of the file's content or the user's intent.

Conclusion: The Evolution of DLP

Jon Loya's insights underscore the importance of evolving DLP techniques. Whether you're starting with data maps or tags—or ready to adopt cutting-edge solutions like Cyberhaven's lineage technology—protecting sensitive information requires a multi-layered approach combined with ongoing maintenance and vigilance.

By understanding these tools and strategies, your organization can navigate the complexities of DLP and build a stronger foundation for data security. To learn more about Cyberhaven's solutions, visit Cyberhaven's website and explore how they're revolutionizing the future of data protection.