Introducing Cyberhaven for AI: Security that keeps confidential data out of public AI models

Abhi Puranam

Guest Contributor

Cyberhaven for AI is a new suite of capabilities that gives companies control over what data goes to generative AI tools.

In this article

Cyberhaven for AI is a new suite of capabilities that gives companies control over what data goes to generative AI tools.

Generative AI products like ChatGPT are taking the world by storm and offer tremendous productivity gains by accelerating mundane and repetitive tasks in the corporate world. Many are not aware, however, that the freely available versions of these products feed user input to continuously train the models that power them. This means sensitive corporate data can be put at risk if directly input into ChatGPT, allowing other users of ChatGPT to access that information.

Today, Cyberhaven is announcing a suite of updates to help organizations tackle the growing risk to corporate secrets, intellectual property, and source code posed by generative AI products like ChatGPT.

Why companies need to limit data to AI tools

Many generative AI programs are built and trained on user input. This isn’t a huge problem if you ask the tool to rewrite your essay for English class, but when an attorney inputs a confidential client document to summarize or an engineer inputs proprietary source code to debug, this information becomes a part of the product’s knowledge database. When other users make their own queries, the AI tool may use that information in its response.

Amazon has already warned its employees against inputting sensitive information after observing ChatGPT outputting data that is very similar to real confidential company data.

By the numbers: sensitive corporate data going to ChatGPT

In the face of this threat, the Cyberhaven Labs team looked closely at how these products, particularly ChatGPT, are being used by our customer’s employees. Across our customer base, we’re observing a tremendous increase in data flowing to and from ChatGPT since its launch in November.

Overall, Cyberhaven classifies 11% of the data flowing to ChatGPT as sensitive. In the last week (February 19 to 25) the average 100,000 person company experienced the following leaks to ChatGPT:

43 leaks of sensitive project files (Example: a land acquisition planning document for a new theme park)
75 leaks of regulated personal data PII (Example: a list of customers with their associated home addresses that needs to be reformatted)
70 leaks of regulated health data PHI (Example: a letter being drafted by a doctor to a patient’s insurance company with details of their diagnosis)
130 leaks of client data (Example: content from a document sent to a law firm by their client)
119 leaks of source code (Example: code used in a social media app that needs to be edited to change its functionality)
150 leaks of confidential documents (Example: a memo discussing how to handle an upcoming regulatory action by the government)

The trend of increasing ChatGPT usage coupled with the significant amount of sensitive data being leaked makes the risk posed by generative AI something information security teams must urgently address.

Cyberhaven for AI: Securely enable AI tools across your workforce while protecting sensitive data

Cyberhaven for AI is a new suite of capabilities that gives companies control over what data goes to generative AI tools. Rather than blocking AI tools altogether, security teams now have the ability to enforce fine-grained controls to limit what corporate data goes to consumer products that incorporate a user’s input into a publicly available model while allowing corporate data to go into soon-to-be released enterprise AI products that do not expose confidential company data outside of the employees who would otherwise be able to see it.

Gain visibility into corporate data flowing to and from AI tools

Cyberhaven automatically logs all data moving to and from any web-based or app-based AI product without the need to define any policies. This means Cyberhaven customers can understand what data has flowed to apps they didn’t even know existed and therefore have not created policies to track or control. Cyberhaven tracks data moving to the products at a granular level, including when users copy/paste content into a chat window like ChatGPT. Armed with these insights, security can have a conversation with the business to develop appropriate policies for how to use these applications.

Visualization of data flow to generative AI apps

Accurately classify and protect sensitive data

The way you interact with a generative AI tool makes it difficult for traditional data security approaches to protect important data. Many AI products are a chat interface, and employees copy data out of another window, file, or app on their computer and paste it into their web browser. In other cases, the tools have more defined workflows, but still the only way to add content for it to be summarized, rewritten, or expanded is by pasting it into the browser. This several challenges for understanding what type of data is going to the AI tool:

Data doesn’t stay in a file – when you copy data out of a file that has a tag/label and paste it into an AI tool, that classification doesn’t follow the data after it leaves the file.
Sensitive data doesn’t contain a recognizable pattern – many forms of confidential data employees input into AI tools lack an alphanumeric pattern like a credit card number.

Cyberhaven utilizes data lineage to classify sensitive data flowing to AI tools. Lineage-based classification can accurately classify any type of sensitive data because even as it’s copied and pasted into the AI tool, Cyberhaven remembers the context of:

Where the data originated – For example, source code originates from the company Github.

How data is handled and stored – For example, text from a document that was saved in the “Corporate Strategy” Google Drive folder

Who created and modified the data – For example, content created by a drug researcher.

Example data flow to ChatGPT

With an accurate classification of the data, companies can craft policies to prevent sensitive data from flowing to unsafe AI applications that add any input into their publicly available machine learning model accessible to anyone, while allowing less sensitive data – enabling your workforce to safely explore how these tools can help them be more productive.

Enforce granular data security policies across AI tools

Cyberhaven customers now have access to out-of-the-box policies they can customize and deploy to protect their sensitive data from flowing to generative AI tools, rather than banning their usage completely. You can enforce different policy responses based on the type of data, the specific generative AI tool, and whether the instance the user is logged in through is a personal or enterprise account.

Block paste of sensitive data, with optional user override

Cyberhaven can automatically intercept and stop a user pasting sensitive data into a risky generative AI tool. They immediately see a popup with a customizable coaching message alerting them why they were stopped and directing them to a safe alternative.

Optionally, you can allow a user to override this blocking response while providing a business justification for their action. This sort of policy can reduce careless behavior by educating the user on potential risks while preserving employee productivity.

Example blocking pop-up from a Cyberhaven policy

Silently alert security to begin an investigation

Rather than block or warn, Cyberhaven can simply log user activity that sends data to a risky application – allowing your team to investigate usage further and then determine a response.

Different policies for personal and corporate accounts

While personal accounts (paid and unpaid) for AI tools like ChatGPT use all inputs to the chat as training data, enterprise-grade generative AI products are in the works and rolling out soon. These enterprise versions will not use inputs as part of publicly available AI models, and enterprises will actually be able to train their own custom models on internal materials like source code, training material, or planning documents. Cyberhaven customers using enterprise versions of generative AI tools can enforce distinct policies for personal and corporate accounts of the same AI tools, ensuring sensitive company data only goes to these internal tools and the output of company data stays within the company.

Securing the AI-powered future of work

The future of an AI-enabled workforce promises to unlock tremendous productivity gains for companies, and Cyberhaven is here to support this growth safely and securely. Blocking these tools altogether is not an option for companies that want to maintain a competitive edge. To learn more about how Cyberhaven can support your AI transformation, schedule a demo with one of our data security experts today.