Data tagging overview
Data tagging is an old concept with a multitude of use cases ranging from data classification to data leak prevention. A few of the most important goals are:
- track who accesses sensitive data such as IP and trade secrets
- assure that sensitive data is not exposed to outsiders or exfiltrated by employees
- discover and know where your private or consumer data is located
- meet compliance requirements for reporting who is accessing data both internally and externally (especially for privacy regulations like GDPR, CPA, HIPAA, etc.)
- better understand business process and gain insights into how to optimize them, to assure that data is properly versioned and backed up and that there is an adequate disaster recovery plan to restore it
In the risk assessments that we do at Cyberhaven, we routinely discover that the knowledge that business managers and security teams have regarding the locations of sensitive data is incomplete and inaccurate. It could be that for example, the data is at some point stored in a designated folder on Box, but sooner or later, it gets moved, copied, and shared to many other places within and outside the enterprise, like pollen during the allergy season.
Nearly all data protection plans will look at egress locations (e.g., email, removable media, personal cloud storage, etc.) in order to determine if the exfiltrated data reaching those locations is sensitive or not. But how can you tell if this data is sensitive or not at the egress point? Data tagging could be an answer, in principle …
Today’s data tagging is opaque. It usually starts with a discovery phase where a tool will crawl throughout various data repositories deemed to be sensitive. Then the security team with the data owners will set up policies to recognize these tags on egress. The data owner must access the risk of data being shared outside the enterprise. This resembles an industrial process where each component of a car is implanted with an RFID tag and then the robots on the assembly line can track and recognize each part, process it, and pass it on until a full car emerges. Sounds great in theory, so what’s wrong with this approach:
- the way data is handled in a modern enterprise is chaotic and tools need to adapt
- tagging only works with some types of files (e.g., Word docs), so it only covers a small fraction of the way data is exchanged nowadays (social media – like Whatsapp)
- opaque tags can be easily removed when data is processed by users or shared with third parties
- it takes a long time to tag data initially during the discovery phase and then this puts the onus on the users to do subsequently manually tag.
In a modern enterprise, data tagging is usually lagging behind the actual business practices and always playing catch-up.
A bit of history
Data classification tools often tag known file formats like Office and PDF using the file metadata properties available for these file formats. The goal is typically to keep track of this data as it is being used by employees, updated, etc. It’s a basic tool to try to keep things organized, mainly with the end goal of proving compliance for some of the tagged data (e.g., PCI).
DLP went one step further and used the same tagging mechanism as an alternative to content inspection which has proven overly noisy and slow. DLP typically scans files at various egress points, e.g., when emailing an attachment. Thus, instead of scanning the content to look for patterns, DLP used the metadata embedded in the files during the discovery process, in an effort to identify their classification.
Needless to say, using the metadata available in some file formats has poor coverage because it only applies to a few file formats. Moreover, it is more of a hack than a principled implementation: metadata fields are not standardized, so there are many formats, the tags can be cleared by various apps, so this can lead to a high number of false negatives. The high false positives burden security teams already drowning in alerts.
To improve on this, some DLP engines moved to use the Alternative Data Streams (ADS) approach available for NTFS file systems (NTFS: NT file system; sometimes New Technology File System) is the file system that the Windows NT operating system uses for storing and retrieving files on a hard disk) so that tags are in the file system and they are easier to keep track of. ADS looks great in theory, but it is far from being universal. Coverage is limited to a single NTFS instance. In addition, tags are not persistent across network shares or when files are emailed from one user to the other; the same applies for data that traverses cloud services, CRMs, etc. – basically the exact type of workflow that is nowadays the norm for most files.
Now imagine the auto industry pipeline example above moves to use computer visualization algorithms instead of having to embed an RFID tag into all the parts on the assembly line. The robots would just see and recognize the parts based on their shape and would record all the parts they have seen. Data tracing uses a similar principle, which I will discuss below.
Technology like DRM (e.g., Microsoft Azure Information Protection(AIP)) is not the first thing one would bring up as data tracing technology because the benefit is mainly data encryption, however, it can also be used for data tracing: AIP keeps a trace of when authorized users request the decryption key for a document and this log of authorizations can, to some extent be considered as a trace of the data. Other approaches wrap a container around each piece of unstructured data. In order to view a PDF document, you now require a custom runtime, making it even harder for third parties to interact with the data.
Unfortunately, in practice, such technology was always hard to deploy and never became the norm because of reduced coverage. It faces the same challenge as Apple’s Messages app which only works with other iPhones, so I can use it for only 5% of my contacts; for the rest, I have to use WhatsApp.
Cyberhaven’s data tracing technology
I used an industrial pipeline analogy to explain data tagging and traditional data tracing technologies because this is what these technologies expect: an ordered world where each workflow is carefully controlled and planned in advance. In my experience, this never holds for how data is handled in enterprises today. Instead, I have learned to expect a fully distributed system with ever-changing shape and behavior; it’s very similar to the way people interact with each other. Does this ring a bell?
In recent days it brings to mind contact tracing technology: your phone acts like a sensor that records a trace of interactions with other people and alerts you when you have been in contact with people who eventually developed COVID-19 symptoms. This is the contact tracing approach that we have the highest hope for to keep us healthy.
Even though it was developed before the contact tracing technology that has become so important in 2020, the Cyberhaven technology has many similarities with contact tracing at a high level.
Cyberhaven uses sensors in places that maximize the coverage one can obtain by monitoring data behavior in real-time, that is in the places where users create, handle and distribute the data: our endpoints and the SaaS platforms we use to handle the data and collaborate. This brings the following benefits:
- works for any piece of unstructured data
- high coverage of data repositories and data behavior (endpoint-local file, website, email activity, within corporate SaaS, egress destinations and applications)
- does not rely on content analysis for tagging, but can enrich the trace with contextual information
- trace data from source/origin of data up to any egress point
- provides the full trace of the data
- no need to set up tags or configure policies a priori
- reduces false positives
- opportunity to interact with users in real-time for just-in-time security training and prevent accidental disclosures
See Data Tracing in action for your high-value data.