Insider Risk: Why Data Lineage And AI Are The Only Path

No items found.

December 18, 2025

•

1 min

In This Article

Introduction and Agenda Overview

Cameron Galbrath : Hi everyone. Thanks for joining us today for our webinar. My name is. Cameron Galbrath and I'm the Senior Director of Product Marketing here at Cyber Haven. Today we're gonna talk about why insider risk has fundamentally changed, not just incrementally, but structurally. So for our agenda today, we'll define insider risk and data protection in this new era.

I'll then share some research from our Cyber Haven Labs team about the broken perimeter and data fragmentation and what that means for organizations trying to secure their data. I will then walk through an example of why data fragmentation breaks these traditional controls on data. And finally, I'll share a little bit about our solution of combining data lineage and ai.

Defining Insider Risk and Data Protection

Cameron Galbrath : So for our purposes, when we talk about insider risk and the goal of insider risk, it's really to secure the most important data that an organization has. Now, of course, there are other aspects of insider risk, such as productivity monitoring that can be important to organizations, but there rarely as existentially important as making sure that a company's most important information doesn't get compromised.

Traditional Data Security Methods

Cameron Galbrath : So how has the industry tried to secure data before? What have we taken in terms of tactics and strategies? If you think about it, there are really two ways that someone can get to data that really matters. Systems and the people and systems and people have always formed this feedback loop around data where more usage creates more data.

Which impacts the systems. And so these two are very tightly intertwined as data's moving between these two sides. And, on the systems side, traditionally we would protect the ways that we would get to data or where that data would be. So we'd protect the cloud where a lot of data lives. We'd protect the network that we might use to access that data and we'd protect the device that's accessing the data from viruses, malware, and things like that.

And then on the people side that's where we would think about trying to change or influence behavior. So we try to do training to prevent mistakes or prevent compromise do phishing tests, things like that to make the employee base aware. And then we might have some controls in place that would try to stop malicious actions like exfiltrating certain certain sensitive data.

And the reality is that most of the investment and the energy over the years has been on the system side, right? So there's all kinds of great cloud security solutions and network security solutions and EDR and others like that. And. While security is always a matter of probabilities, that side of the system, of the feedback loop is pretty well protected, and that means that the people element, the human side of things is where so much of the risk has shifted to, because that's, in relative terms, the easier side to compromise.

That's part of why insider risk has become such a hot topic for so many organizations, and that's happening at a time when there's a lot of fundamental changes happening in the landscape.

The Shift to Fragmented and Derivative Data

Cameron Galbrath : So to really get into it, to complicate things. In the simpler world of data security, there were some assumptions around data and how we could protect it.

So that included things like data, if it's not in a data store or a structured database, but even if it's unstructured, that's still gonna live in files. And those files are gonna have tags and labels and metadata, and those files will live in systems. And those systems will define what we trust.

And it was a fairly neat system and it wasn't always perfect, but it was certainly easy. And the thinking was that you could use some of those tags and labels and categorizations downstream, so to speak, when you're trying to protect that data. But now we've had a shift where we're seeing a rise in fragmented and derivative data that is still business critical, but easily evades controls.

Research Findings on Data Exfiltration

Cameron Galbrath : We'll talk about what that looks like in just a moment, but to get to the punchline, our team here at cyber Haven Labs, which is a research arm of our company, so they did some research and found that over 80%. Of critical data. So this could be stuff that's extremely important to the organization.

Strategic plans, customer data, future patents, other intellectual property and trade secrets. That kind of critical data was exfiltrated by employees and 80% of that exfiltration came in the form of fragmented or derivative data. And that means that most data loss no longer looks like somebody downloading a concrete sensitive file.

It actually looks like normal work. Now to make matters even more complicated on average, we also found that sensitive data is copied six times before it leaves an organization. And furthermore, over 60% of insider risk incidents involve fragments of data that Legacy DLP solutions, what we've traditionally used to protect data would never flag.

And fewer than 20% of organizations in a market survey we're able to trace the full path of their sensitive data end to end. So this is not just a tuning problem or a matter of adjusting the accuracy to get more or less false positives or false negatives. This is really an a visibility problem.

Challenges with Traditional Data Protection

Cameron Galbrath : And speaking of visibility, part of the challenge here is that traditional tools they'll see static snapshots of that data, but data today really is constantly in motion, and that's where the risk is.

So how do we get here to this new landscape and this new situation where data is getting harder and harder to protect?

Understanding Data Lineage

Cameron Galbrath : And what exactly do we mean by fragmented and derivative data? So consider a common workflow. Let's take an example of how data might move through an organization and what controls we might have on that.

So if you consider, just a common workflow where we're gonna start with the end result. If we care about insider risk because we're trying to protect sensitive data from getting out, then we might have some control controls in place at the end point where we look at a user and what they're doing.

And in this case, that might be a user that's uploading a file to a third party or an external shared drive Dropbox in this case. And between a combination of A-A-D-L-P solution and an insider risk solution, one of those legacy tools might be able to see that particular file. In this case, it's, we'll call 'em Bill.

That bill uploaded is a file that is called holiday cards. Now, is this a. A risky or a sensitive movement of data? It's actually hard to tell unless we have a lot more information, because yeah, this could be an innocent movement of data or it could be hiding something more. So if we could rewind the clock and see what happened, see where this data came from, who interacted with it, how it's been modified over time, then we would have a lot more context.

And in this case, we would see that a lot more people besides Bill interacted with this data, copied it, modified it, created derivatives of it, moved it between devices, shared it with multiple people, uploaded it to a cloud SaaS application, downloaded it from one further, modified it. And so there's all these activities, all this history.

Of what happened to that data from its original source to when we actually saw a bill trying to do something with that data. And so knowing that, then we can do a lot more to understand Bill's intent. Now, part of what going through here is there's been a lot of modifications of this data, so there are derivatives of it happening, and there are also some fragmentation or fragments of data that are being created.

So you can imagine, say one of those sheets in the entire spreadsheet file is getting shared out with somebody, or something's being copied and pasted from that spreadsheet into Slack. And that gets to my next point, which is how the usage of data, particularly over the last. Three to four years or so.

Really with the rise of collaboration tools and especially accelerated by generative ai, the way that data moves within an organization is a lot more fragmented, and it's moving a lot faster across a lot more different paths.

Examples of Data Fragmentation

Cameron Galbrath : So let's take an example here. All right, so you might be putting together some information on one of your customers, and you might take that information or some piece of information on that customer from a system of record.

So that could be Salesforce in this example, and you're gonna paste that snippet of data into a document. That's not moving in a file. There's no. Metadata to record on there. There's no label to attach to that, but it's something that's pretty sensitive and that's going into a document. And then similarly, we might take a support system that has a number of open tickets for that account and put that into a document as well, and then there'd be some spreadsheets.

With customer information, maybe other data that, that we've enriched about that customer and other customers. And that's gonna be referenced in that document because they can be linked. And then when we start adding in AI to this mix, you can have things like Slack bots or co-pilots or other agents that are operating in the environment that are able to access that data, understand it, share it with others.

And the result is that you've got these pieces of data that are moving around through the organization without any of the tags, any of the labels, any of the metadata, because they're not moving in files themselves. So this actually becomes very hard to track, and this is the core of this fragmented and derivative data problem.

How do you protect data without so many of the traditional tools in the toolkit that we've historically relied on? Of course one of the questions that comes up is, okay, what if we did smarter content inspection? In other words, a common reaction is to say, if we don't know enough about this data because it's moving its snippets, what if we did a better job of inspecting that and with ai, ah, we can bring a ton of intelligence to bear to analyze this data and that, go beyond regular expressions and other techniques that we've used that rely on lots of data and lots of things that we could.

See patterns in, but instead say, Hey, let's use AI to classify data more accurately. And we can understand the meaning there, not just keywords or labels that are in there. And there's a lot of value to that, and that instinct is understandable, but the fact is that even though that is quite useful, it's still an incomplete solution to this problem because the content itself is only part of the puzzle.

So thinking back to the lineage diagram, all that other information about the data and where it came from, even if you've got perfect content inspection, that still doesn't give you enough to answer some of the most important questions. Like, where did this data come from? How did it change? Why is it moving now?

Because classification is only gonna see the current snapshot, but it doesn't have a memory of where that data came from. So going back to this example we outlined here in the dotted lines, there is so much more that happen to this data and so many more people and systems that interacted with it, right?

Again, here it was copied, pasted, downloaded, uploaded, emailed around. Even in this very simple example, and so data lineage, something that gives you that history of the data, shifts this question from what is the content to what happened to this data? What is the context around this data? It tracks data from creation through every transformation and movement across people, apps, systems.

In other words, it provides that vital context that matters when we're trying to identify risky behavior and reduce insider risk. So let's drill into a couple more examples of why does this context matter when we're looking at sensitive content. Let's look at some facets of content inspection or what you would look at with sensitive data.

So remember content inspection, as you'd see from longstanding DLP solutions and probably some solutions that you've got in place in your environments, they're typically gonna going to rely on pattern recognition, right? They'll look for patterns in the data that will indicate whether some data is sensitive or not, and a major pain that we've heard from our customers, probably ones that you've experienced with this approach.

Is that the patterns can often create false positives. In other words, there's a tuning challenge where if you use pattern recognition to find things that would indicate sensitive data. So think about the pattern for a social security number here in the us, three digits dash two digits, dash four digits.

In this case, in this example here. An account ID or some other string of numbers following the same length or the same pattern that could inadvertently show up as social security numbers, when in fact that's just an account id. Now, that might be another kind of sensitive data, but it's not the sensitive data that the system thinks it is, and it's leading the analysts and leading the data security team down the wrong path.

Now, of course, the other big problem with relying and looking just at the content is that sometimes we can't actually do that. So in the example on the right, think about encrypted data, if a file is encrypted before it ever passes through some gate, meaning a DLP solution or something that's gonna do content inspection, there's a very limited number of ways that you could ever.

Understand what's in that file, if it's really encrypted well, and especially if there was a malicious intent behind it where somebody took actions to obfuscate it. So that's where data Lineage comes in to provide some really necessary content, context to this content. So to take another example here, let's say you had similar to the Lineage example where you had a spreadsheet and that had some names and some email addresses.

Is that sensitive? If we look at the case on the left, it's hard to say because we could identify that there's some email addresses, but. Are those PII patient data in a healthcare example, or is this something more benign? Now you and I looking at this could see that, oh, these are tip lines for media.

This is something that probably somebody in a PR organization, a marketing team might have, and it's a handy list, but these are publicly available emails. It's not really that sensitive, but just by looking at that data, without some knowledge of that, some context, it's hard to say. Now, if we then combine the content and the context together, then we can accurately classify this.

And in this case, we can say, oh, this actually is not very sensitive data. Because the lineage, the provenance of this data, right where it came from, indicates that it's, this is actually from public information and. It's used by the marketing and communications teams and, okay, this is not something that is existentially important to the business.

So I'll give you a final example in really simple terms. So think about if you saw or one of your insider risk and DLPs system saw this number 2 9 4 3 0 3. You know that it's a number. It'd be hard to tell more than that. Then if there's a dollar sign on there, we can start inferring some things about that.

Okay. It's a financial metric, so maybe this is a customer's contract value. Maybe it's something from our finance team. Still hard to say, but we can know that this is something that would be sensitive. Now if we can bring more context into this. Then we could see, ah, okay, $294,303 is the company's unreleased Q3 revenue number, because we'd know where this data came from and have a lot of context around how it was created and where it's going and who's interacting with it.

And yes, using some of the labels and other traditional techniques. There's so much more information that we can use to understand how sensitive data that, how sensitive that data really is. Now, that's fine for a simple example, a lot of people ask how does this actually work at scale?

And it is actually quite a difficult engineering challenge because if you think about it, if you're tracing the flow of data through an organization, what does that mean? You're looking at. Potentially thousands, tens of thousands of users who were taking hundreds or thousands of events and actions every day.

And you're gonna want to have some historical understanding of that data. So that's gonna span years, potentially. And so the result of all these different things multiplied together is that you get. Trillions of different lineages. All right. Think about how many snippets of data you might bring together to make a presentation or to do some research to find out what's happening, or just to share updates and how many people are gonna interact with that and have access to it, and then maybe take that data and do something legitimate with it.

But that's still another path that data is taking. So if you think about it tracing these billions and trillions of data events manually, that would be impossible. In fact, even just to store this data, we had to create our own proprietary graph database technology in order to have a performant understanding of this at scale.

AI and Data Lineage in Action

Cameron Galbrath : But this is where AI becomes really exciting and where AI helps, because Lineage, you could think of it as creating the map, but AI is what's going to interpret this at scale. So in practical terms, how do these things work together and layer together? So in our case, we start with the data lineage, the deep context about where the data came from, who interacted with it.

How it was modified and fragmented over time. Now to that, we then add a layer of intelligence AI to bring a deep understanding of that data and that context into the analysis. And that means that we can not only use the traditional methods of regular expressions and so on and other patterns to classify that data.

But we can leverage a deep semantic understanding of the content so we can see the data and its history. We can use AI to classify it. And there's another really powerful way that we can leverage AI building on this foundation, which is that we can make this information actionable. In this case, with AI agents and in particular, we do this in two big ways.

So first of all, we can apply AI to the data flows, looking at the data flows and the lineages to identify risky behavior. And we can do that even in the case where a policy isn't created. So we don't necessarily need to have a rule to say, if this, then that, or if this happens, then that's a risky activity.

We have a complete understanding of. What is normal and abnormal for that organization, how that data's traditionally been handled. And then when there is something that really stands out, we can identify that and then we can go further with our other AI agent, which we call the linear AI analyst agent.

And that's really cool because it will launch an investigation automatically. And it's gonna do the job of much like one of the analysts on your team, or if you're an analyst, which you would do, right? You would look at all kinds of different data sources and you'd look at the history of what that user did and what they were doing in context, and maybe some screen recordings of what the activities were and where that data came from.

And you bring all this information to bear and you'd be able to piece together a story. And you put together a report and you make some recommendations about what would happen. And so Linea is able to do that automatically right away at Superhuman Scale to create these really comprehensive, incredible reports about what happened.

So from just seeing the activities through data lineage, adding layers of understanding with AI classification, and then being able to take action with AI agents, this combination of data lineage and AI is a really powerful combination to secure sensitive data.

Cyber Haven's AI and Data Security Platform

Cameron Galbrath : And so from a product standpoint, to go a little bit further, our mission really is to help our customers protect their data wherever it lives and goes.

And we've incorporated this combination of data lineage and ai in a few different ways. So I've talked a bit about the deep understanding of the data that we bring to bear with data lineage and AI categorization and classification, and then we can layer in other attributes like identity and access.

So we have that deep understanding, but then we also go broad. Because not only do we understand what is sensitive for your organization, but we can actually go broad to see the holistic view and give you holistic visibility and control of that data. So that means understanding your data at rest in motion.

In use, whether it's in the cloud, on-prem, a hybrid arrangement, so on endpoints, wherever it is, wherever it goes, including all the major exfiltration vectors. So whether that's uploading to a cloud share or uploading to a generative AI tool or pacing into one of those, we're able to see that and protect it.

Now, all of that is great, but one of the things that we really understood from working with hundreds of major enterprises is that operationalizing a data security program can actually be pretty challenging. And so we made sure that when we built our solution from the ground up, we made it really easy to operate, meaning that it has non-disruptive endpoint agents.

Lightweight connectors, simple administration and that layer of ag agentic AI to bring some automation into it and intelligence into it to detect and prioritize and investigate those incidents. And so we've brought this together in what we call the Cyber Haven AI and Data Security Platform. The builds on those principles connects all of your major data sources and where that data lives and where it goes.

And then we've created what we believe are these market leading solutions across the entire lifecycle of data. So whether you're discovering and assessing your data or trying to minimize the exposure as you would with a data security posture management solution, if you're trying to embrace ai, but put some guardrails around it and understand shadow AI in your organization, we help you secure AI usage.

Of course, we talked a bit about how we understand that people are com a key component of this ecosystem of data, and so we have some tools for insider risk management to identify those risky data flows and potential at-risk employees that are handling data in a potentially challenging way to surface that up to administrators.

And then finally to put those controls in place at the end with data loss prevention to make sure that you can identify your sensitive data, see how it's moving through your organization. And finally, if it is something that you need to stop from going out, make sure that you can stop it from leaving and resulting in fines or, market impacts or any of the downsides that drive our programs.

So to summarize, we bring this together by going beyond just the content inspection, which is only gonna see the snapshots of those data. The point in time visibil have that point in time visibility into where data is. And we do that through Lineage because data lineage reveals the context behind the data.

Then we go a bit further with AI classification, and that effectively makes it smarter. We enrich that understanding of what the content is and the context of where it's been and what the meaning of that really is. Again, in the context of your organization and what is sensitive or not sensitive to you.

This combination explains exactly what happened, and with that deep understanding, we can operationalize it through our unified AI and data security platform.

Conclusion

Cameron Galbrath : So thank you so much for joining us today for going over insider risk and how we can help you protect your data in this new era of ai fragmented data, derivative data.

And as more and more pathways for data are opening up and there's more and more opportunities for risk, and we're really excited to work with you and our customers to help operationalize these programs and help you protect the data that's most sensitive to you. So thank you very much, and if you have any questions or comments, make sure to drop those into the chat and we'll get those answered for you right away.

Thank you.