Back to Blog
11/21/2025
-
XX
Minute Read

How AI Companies Can Use Data Lineage To Stop IP Theft — And Win When It Goes To Court

Dmytro Zherebko
Dmytro Zherebko
Guest Contributor
Senior Manager, Software Engineering
Mick Jeon
Mick Jeon
Guest Contributor
Machine Learning (ML) Engineer

In this article

What you will learn

  • How organizations can lose AI IP through developer workflows
  • Why “standard” controls and legacy DLP miss model weights, datasets, vectors, and prompts
  • Practical controls that work across notebooks, registries, endpoints, and SaaS
  • Why lineage‑first data security is a superior approach for organizations
  • And how lineage-based data security provides a defensible chain‑of‑custody if litigation becomes necessary.

Real-world exfiltration – focus on flows, then files

The 21st-century gold rush is the AI boom, and it is producing a wave of emerging AI companies. Being the first to build and apply AI in novel ways successfully is the difference between success and failure. Because of this, companies can find themselves making a trade-off between time-to-market and security.

AI development teams move fast, and their workflows are fluid: artifacts are created in notebooks, pulled from registries, written to endpoints, synced to SaaS, and shared via links — a chain of actions traditional controls rarely see end‑to‑end.

Think of exfiltration as a sequence, not a single object.

Intentional theft

  1. An AI research scientist downloads a multi-gig. Model checkpoint (.pt) or RLHF dataset (.parquet) from cloud storage to his or her local machine
  2. They zip the file
  3. They change the file extension to .txt
  4. They upload it to their personal Google Drive

Negligence

  1. A developer runs an evaluation and produces a checkpoint-like model_v3.safetensor in the notebook runtime (creates an artifact).
  2. They click Download, and the artifact moves from the notebook to the local machine (endpoint write).
  3. A consumer sync client auto‑uploads ~/Downloads to a personal Drive account (SaaS upload outside enterprise control).
  4. They share a drive link in a private Slack DM with a contractor, who pulls the file (external access).

Companies that use a standard approach with scanners that only look at “what’s in the registry now?” miss the movement and the context that can show intent or negligence.

To stop theft — and prove it later — you need to capture the sequence and bind it to the artifact’s immutable identity across every surface.

How legacy data loss Prevention and insider risk management can fall short

Legacy DLP wasn’t built with AI in mind.

Traditional DLP was designed for documents, not AI models and AI fragments. As a result, these solutions begin to fail in several ways. 

  • Model weights/checkpoints are opaque binaries (.pt, .safetensors, .ckpt) with no searchable text; many scanners skip them or return noise.
  • Large file sizes (e.g., tens of GB) make inline inspection slow, expensive, or infeasible; teams have observed scanners that can’t process very large models at all.
  • Tags/labels are brittle and don’t survive copy/zip/rename, so provenance is lost along the way.
  • DLP only detects the egress, step 4 above, in the malicious and negligent workflows

The result is alerts, not answers: you might see “something uploaded,” but you can’t tell if it was an authorized export or theft, nor can you reconstruct origin → movement → egress.

Feature Legacy DLP Approach Why does it fail for AI
File Inspection Scans document text for keywords. Opaque Binaries: Model weights (.pt, .safetensors) have no searchable text.
Performance Inline inspection of standard files. File Size: Models are often 10GB+, causing scanners to time out or fail.
Tracking Uses brittle tags or labels. Fragility: Provenance is lost immediately if the file is zipped or renamed.
Scope Focuses on the final egress point. Blind Spots: Misses the chain of custody (creation → movement) needed for court.

Insider risk management. Poor risk controls = access + opportunity

As highlighted earlier, insider risk is a critical pathway for AI IP loss and/or theft because employees, and in many cases, contractors and partners, are provisioned access to crown‑jewel artifacts and can move them through everyday workflows that bypass perimeter controls.

Although in the vast majority of cases, day-to-day events are innocent, albeit negligent (e.g., convenience copies, shadow sync), some are malicious (take‑to‑leave). In both cases, the impact and loss of IP can be the same.

With many existing IRM tools, these actions can be flagged as noise, false alerts to be ignored, or closed without further investigation. To be effective, IRM solutions must be able to determine provenance (data origin and history) and intent through contextual analysis of user actions.

What separates winners in investigations (and in court)

Organizations that treat artifacts as first‑class objects can reconstruct a story in minutes and hand counsel a defensible chain of custody.

  • Ensuring broad visibility into all datasets and model artifacts that can be scattered across your organization’s cloud footprint.
  • Capture the artifact at creation with metadata (run_id, dataset_id, owner) and a cryptographic hash (e.g., sha256).
  • Record events from cloud to endpoint and correlate them to the artifact identity – these events can include registry publishes/downloads, CI outputs, notebook export intents, endpoint writes, presigned‑URL issuance, and SaaS uploads.
  • Sign training‑run attestations and store audit logs immutably.

Operationalizing these steps converts operational telemetry into legal‑grade evidence and turns “an alert” into “admissible, time‑lined proof of origin, ownership, and egress.”

A practical and prioritized approach to ensure success

The goal is simple: capture flows and bind them to immutable artifact identities so you can enforce with low noise and prove what happened if needed.

Short term

  • Require managed devices (or cloud workspaces) for sensitive models and fine-tuning datasets.
  • Use SaaS DLP to flag large binary uploads and presigned‑link posts in Slack/email as detection points.

Medium term

  • Notebook/IDE guardrails: route exports to monitored registries; block direct downloads of .pt/.safetensors/.ckpt/.parquet files.
  • Endpoint DLP with process attribution and immediate hashing on write; watch Downloads and common upload processes.
  • Presigned‑URL proxy + CI gating: centralize issuance with lineage checks and short TTLs; prevent publication of containers/artifacts with weights unless attested.

Long-term (definitive)

  • Artifact‑first lineage: require registries to record run_id, dataset_id, owner, and sha256; sign run attestations; store audit logs immutably.
  • Cross‑surface correlation into a single incident timeline; export legal‑ready forensic bundles (hashes, attestations, access logs, policy decisions).

How Cyberhaven ties it together

Cyberhaven models each artifact, records provenance, correlates events across registries/notebooks/endpoints/SaaS, and enforces policy by origin and role — the difference between having an alert and having a defensible case.

  • Artifact lineage and signed attestations at creation (immutable hashes + run metadata).
  • Cross‑surface telemetry (registry hooks, notebook guards, endpoint capture, SaaS connectors) correlated into a single incident.
  • Presigned‑URL gating with lineage checks and short TTLs to shrink blast radius and keep evidence intact.

A protected team can point to the same sha256, datasource, and time‑stamped access logs across registries, endpoints, and Slack — a chain‑of‑custody counsel can rely on.

Would Cyberhaven stop the example exfiltration flow?

Let’s revisit the exfiltration sequence

  1. Download sensitive IP (intellectual property)  to a local machine
  2. Zip the file 
  3. Change extension to obfuscate
  4. Upload it to your personal Google Drive 

How Cyberhaven could have helped to prevent this sequence

  • Create a connection (data lineage) between the cloud and to endpoint with visibility of data from DSPM and the endpoint agent
  • Extend the data lineage for the file – surviving through compression and renaming
  • AI data categorization will classify the file with internal provenance

The mindset shift

Protecting AI IP isn’t a single scanner; it’s governance over the flow. Treat artifacts as living assets, follow their journey across tools, and make that journey auditable — so you can stop exfiltration in motion and, if it happens, prove it in court.