11% of data employees paste into ChatGPT is confidential

Cameron Coles

Guest Contributor

VP of Marketing

The average company leaks confidential material to ChatGPT hundreds of times per week. ChatGPT is incorporating that material into its publicly available knowledge base and sharing it.

In this article

The average company leaks confidential material to ChatGPT hundreds of times per week. ChatGPT is incorporating that material into its publicly available knowledge base and sharing it.

Updated June 18, 2023

Since ChatGPT launched On November 30, 2022 it’s taken the world by storm. People are using it to create poems, essays for school, and song lyrics. It’s also making inroads in the workplace. According to data from Cyberhaven’s product, as of June 1, 10.8% of employees have used ChatGPT in the workplace and 8.6% have pasted company data into it since it launched.

Some knowledge workers say that using the tool makes them 10 times more productive. But companies like JP Morgan and Verizon are blocking ChatGPT over risks to confidential data. Our analysis shows that 4.7% of employees have pasted confidential data into ChatGPT.

The problem with putting company data into ChatGPT

OpenAI uses the content people put into ChatGPT as training data to improve its technology. This is problematic because employees are copying and pasting all kinds of confidential data into ChatGPT to have the tool rewrite it, from source code to patient medical records. Recently, an attorney at Amazon warned employees not to put confidential data into ChatGPT, noting, “we wouldn’t want [ChatGPT] output to include or resemble our confidential information (and I’ve already seen instances where its output closely matches existing material).”

Consider a few examples:

A doctor inputs a patient’s name and details of their condition into ChatGPT to have it draft a letter to the patient’s insurance company justifying the need for a medical procedure. In the future, if a third party asks ChatGPT “what medical problem does [patient name] have?” ChatGPT could answer based on what the doctor provided.
An executive inputs bullet points from the company’s 2023 strategy document into ChatGPT and asks it to rewrite it in the format of a PowerPoint slide deck. In the future, if a third party asks “what are [company name]’s strategic priorities this year,” ChatGPT could answer based on the information the executive provided.

On March 21, 2023 OpenAI shut down ChatGPT due to a bug that mislabeled chats in user’s history with the titles of chats from other users. To the extent that those titles contained sensitive or confidential information, they could have been exposed to other ChatGPT users.

On April 6. 2023 news broke that Samsung discovered employees putting confidential data into ChatGPT including source code in order to debug it and transcripts of internal meetings to summarize them. As an emergency measure the company limited input to ChatGPT to 1024 bytes.

‍

Identifying what data goes to ChatGPT isn’t easy

The traditional security products companies rely on to protect their data are blind to employee usage of ChatGPT. Before blocking ChatGPT, JP Morgan reportedly couldn’t determine “how many employees were using the chatbot or for what functions they were using it.” It’s difficult for security products, like legacy data loss prevention platforms to monitor usage of ChatGPT and protect data going to it for two reasons:

Copy/paste out of a file or app — When workers input company data into ChatGPT, they don’t upload a file but rather copy and paste content into their web browser. Many security products are designed around protecting files (which are tagged confidential) from being uploaded but once content is copied out of the file they are unable to keep track of it.
Confidential data contains no recognizable pattern — Company data going to ChatGPT often doesn’t contain a recognizable pattern that security tools look for, like a credit card number or Social Security number. Without knowing more about its context, security tools today can’t tell the difference between someone inputting the cafeteria menu and the company’s M&A plans.

Despite some companies blocking ChatGPT, its use in the workplace is growing rapidly

Cyberhaven Labs analyzed ChatGPT usage for 1.6 million workers at companies across industries that use the Cyberhaven product. Since ChatGPT launched publicly, 10.8% of knowledge workers have tried using it at least once in the workplace and 8.6% have pasted data into it. Despite a growing number of companies outright blocking access to ChatGPT, usage continues to increase. On May 31, our product detected a record 7,999 attempts to paste corporate data into ChatGPT per 100,000 employees, defined as “data egress” events in the chart below.

Cyberhaven also tracks data ingress such as employees copying data out of ChatGPT and pasting it elsewhere like a Google Doc, a company email, or their source code editor. Workers copy data out of ChatGPT more than they paste company data into ChatGPT at a nearly 2-to-1 ratio. This makes sense because in addition to asking ChatGPT to rewrite existing content, you can simply type a prompt such as “draft a blog post about how problematic ChatGPT is from a data security standpoint” and it will write it from scratch. Full disclosure: this post was written the old fashioned way by a human being. 🙂

The average company leaks sensitive data to ChatGPT hundreds of times each week

Since ChatGPT launched, 4.7% of employees have pasted sensitive data into the tool at least one. Sensitive data makes up 11% of what employees paste into ChatGPT, but since usage of ChatGPT is so high and growing exponentially this turns out to be a lot of information. Cyberhaven Labs calculated the number of incidents per 100,000 employees to understand how common they are across companies. You can apply this rate of incidents to the number of employees at any given company to estimate how much data employees are putting into ChatGPT.

Between the week of February 26 and the week of April 9, the number of incidents per 100,000 employees where confidential data went to ChatGPT increased by 60.4%. The most common types of confidential data leaking to ChatGPT are sensitive/internal only data (319 incidents per week per 100,000 employees), source code (278) and client data (260). During this time period source code eclipsed client data as the second most common type of sensitive data going to ChatGPT.

A few bad apples?

At the average company, just 0.9% of employees are responsible for 80% of egress events — incidents of pasting company data into the site. The number is still relatively small, but any one of the egress events we found could be considered an insider threat, responsible for exposing a critical piece of company data. There are many legitimate uses of ChatGPT in the workplace, and companies that navigate ways to leverage it to improve productivity without risking their sensitive data are poised to benefit.