Security best practices

11/21/2025

Minute Read

How AI Companies Can Use Data Lineage To Stop IP Theft — And Win When It Goes To Court

Dmytro Zherebko

Guest Contributor

Senior Manager, Software Engineering

Mick Jeon

Guest Contributor

Machine Learning (ML) Engineer

In this article

What you will learn

How organizations can lose AI IP through developer workflows
Why “standard” controls and legacy DLP miss model weights, datasets, vectors, and prompts
Practical controls that work across notebooks, registries, endpoints, and SaaS
Why lineage‑first data security is a superior approach for organizations
And how lineage-based data security provides a defensible chain‑of‑custody if litigation becomes necessary.

Real-world exfiltration – focus on flows, then files

The 21st-century gold rush is the AI boom, and it is producing a wave of emerging AI companies. Being the first to build and apply AI in novel ways successfully is the difference between success and failure. Because of this, companies can find themselves making a trade-off between time-to-market and security.

AI development teams move fast, and their workflows are fluid: artifacts are created in notebooks, pulled from registries, written to endpoints, synced to SaaS, and shared via links — a chain of actions traditional controls rarely see end‑to‑end.

Think of exfiltration as a sequence, not a single object.

Intentional theft

An AI research scientist downloads a multi-gig. Model checkpoint (.pt) or RLHF dataset (.parquet) from cloud storage to his or her local machine
They zip the file
They change the file extension to .txt
They upload it to their personal Google Drive

Negligence

A developer runs an evaluation and produces a checkpoint-like model_v3.safetensor in the notebook runtime (creates an artifact).
They click Download, and the artifact moves from the notebook to the local machine (endpoint write).
A consumer sync client auto‑uploads ~/Downloads to a personal Drive account (SaaS upload outside enterprise control).
They share a drive link in a private Slack DM with a contractor, who pulls the file (external access).

Companies that use a standard approach with scanners that only look at “what’s in the registry now?” miss the movement and the context that can show intent or negligence.

To stop theft — and prove it later — you need to capture the sequence and bind it to the artifact’s immutable identity across every surface.

How legacy data loss Prevention and insider risk management can fall short

Legacy DLP wasn’t built with AI in mind.

Traditional DLP was designed for documents, not AI models and AI fragments. As a result, these solutions begin to fail in several ways.

Model weights/checkpoints are opaque binaries (.pt, .safetensors, .ckpt) with no searchable text; many scanners skip them or return noise.
Large file sizes (e.g., tens of GB) make inline inspection slow, expensive, or infeasible; teams have observed scanners that can’t process very large models at all.
Tags/labels are brittle and don’t survive copy/zip/rename, so provenance is lost along the way.
DLP only detects the egress, step 4 above, in the malicious and negligent workflows

The result is alerts, not answers: you might see “something uploaded,” but you can’t tell if it was an authorized export or theft, nor can you reconstruct origin → movement → egress.

Feature	Legacy DLP Approach	Why does it fail for AI
File Inspection	Scans document text for keywords.	Opaque Binaries: Model weights (.pt, .safetensors) have no searchable text.
Performance	Inline inspection of standard files.	File Size: Models are often 10GB+, causing scanners to time out or fail.
Tracking	Uses brittle tags or labels.	Fragility: Provenance is lost immediately if the file is zipped or renamed.
Scope	Focuses on the final egress point.	Blind Spots: Misses the chain of custody (creation → movement) needed for court.

Insider risk management. Poor risk controls = access + opportunity

As highlighted earlier, insider risk is a critical pathway for AI IP loss and/or theft because employees, and in many cases, contractors and partners, are provisioned access to crown‑jewel artifacts and can move them through everyday workflows that bypass perimeter controls.

Although in the vast majority of cases, day-to-day events are innocent, albeit negligent (e.g., convenience copies, shadow sync), some are malicious (take‑to‑leave). In both cases, the impact and loss of IP can be the same.

With many existing IRM tools, these actions can be flagged as noise, false alerts to be ignored, or closed without further investigation. To be effective, IRM solutions must be able to determine provenance (data origin and history) and intent through contextual analysis of user actions.

What separates winners in investigations (and in court)

Organizations that treat artifacts as first‑class objects can reconstruct a story in minutes and hand counsel a defensible chain of custody.

Ensuring broad visibility into all datasets and model artifacts that can be scattered across your organization’s cloud footprint.
Capture the artifact at creation with metadata (run_id, dataset_id, owner) and a cryptographic hash (e.g., sha256).
Record events from cloud to endpoint and correlate them to the artifact identity – these events can include registry publishes/downloads, CI outputs, notebook export intents, endpoint writes, presigned‑URL issuance, and SaaS uploads.
Sign training‑run attestations and store audit logs immutably.

Operationalizing these steps converts operational telemetry into legal‑grade evidence and turns “an alert” into “admissible, time‑lined proof of origin, ownership, and egress.”

A practical and prioritized approach to ensure success

The goal is simple: capture flows and bind them to immutable artifact identities so you can enforce with low noise and prove what happened if needed.

Short term

Require managed devices (or cloud workspaces) for sensitive models and fine-tuning datasets.
Use SaaS DLP to flag large binary uploads and presigned‑link posts in Slack/email as detection points.

Medium term

Notebook/IDE guardrails: route exports to monitored registries; block direct downloads of .pt/.safetensors/.ckpt/.parquet files.
Endpoint DLP with process attribution and immediate hashing on write; watch Downloads and common upload processes.
Presigned‑URL proxy + CI gating: centralize issuance with lineage checks and short TTLs; prevent publication of containers/artifacts with weights unless attested.

Long-term (definitive)

Artifact‑first lineage: require registries to record run_id, dataset_id, owner, and sha256; sign run attestations; store audit logs immutably.
Cross‑surface correlation into a single incident timeline; export legal‑ready forensic bundles (hashes, attestations, access logs, policy decisions).

How Cyberhaven ties it together

Cyberhaven models each artifact, records provenance, correlates events across registries/notebooks/endpoints/SaaS, and enforces policy by origin and role — the difference between having an alert and having a defensible case.

Artifact lineage and signed attestations at creation (immutable hashes + run metadata).
Cross‑surface telemetry (registry hooks, notebook guards, endpoint capture, SaaS connectors) correlated into a single incident.
Presigned‑URL gating with lineage checks and short TTLs to shrink blast radius and keep evidence intact.

A protected team can point to the same sha256, datasource, and time‑stamped access logs across registries, endpoints, and Slack — a chain‑of‑custody counsel can rely on.

Would Cyberhaven stop the example exfiltration flow?

Let’s revisit the exfiltration sequence

Download sensitive IP (intellectual property) to a local machine
Zip the file
Change extension to obfuscate
Upload it to your personal Google Drive

How Cyberhaven could have helped to prevent this sequence

Create a connection (data lineage) between the cloud and to endpoint with visibility of data from DSPM and the endpoint agent
Extend the data lineage for the file – surviving through compression and renaming
AI data categorization will classify the file with internal provenance

The mindset shift

Protecting AI IP isn’t a single scanner; it’s governance over the flow. Treat artifacts as living assets, follow their journey across tools, and make that journey auditable — so you can stop exfiltration in motion and, if it happens, prove it in court.

‍