Yahoo Employee Stole 570,000 Pages of Source Code the Day He Quit to Join a Competitor

Alex Lee

August 27, 2022

•

1 min

Updated:

March 21, 2025

In This Article

Example H2

According to a recent lawsuit, Yahoo’s alleged insider downloaded valuable source code, just as he was set to join a direct competitor.

What happened with the recent Yahoo data breach?

According to a report from The Drum, a recent lawsuit from Yahoo alleges that a former employee, Qian Sang, exfiltrated intellectual property to a personal device after getting a job offer from a direct competitor to Yahoo’s advertising business unit, The Trade Desk.

Sang was one of the most senior leaders of the research team of Yahoo’s AdLearn. Specifically, he was the head of their budget spend pacing control system, which is the core engine behind constantly adjusting bid pricing and frequency in the demand-side platform (DSP). He received a job offer from The Trade Desk for a Staff Data Scientist position, which included a raise on his base salary, a six-figure signing bonus, and a stock plan totaling almost $1m vesting over several years.

Yahoo claimed that roughly 45 mins after receiving his offer letter from The Trade Desk, Sang downloaded source code to his Yahoo company laptop and transferred it to two personal external storage devices (without permissions to do so). Furthermore, Sang maintained possession of the devices until Yahoo filed a cease-and-desist order a few weeks later.

Upon receiving Sang’s physical devices, Yahoo ran a forensic analysis, which revealed the full measure of his downloaded content. This included:

570,000 pages of source code (including budget spend pacing control algorithms, and source code of AdLearn, aka Yahoo’s engine for the company’s DSP (a digital marketplace for real-time ad buying).
Files titled “Bidding Research”
Strategy documents (including competitive analysis of The Trade Desk)

In addition, it was found that Sang had communicated previously via WeChat with someone unidentifiable about using a Western Digital cloud system for file back-up functionality.

Where traditional data security tools would falter at stopping Sang’s exfiltration:

Enterprise DLP tools struggle to accurately classify any type of computer code, or distinguish between the company’s proprietary source code and open-source code. These limitations make it extremely difficult to automate the classification and protection of source code.

Data classification tools can tag files like Microsoft Office documents and PDFs but not data outside of files like code stored in Github. As the number of cloud apps that employees use keeps growing, it is becoming increasingly difficult to rely on manual classification technologies to protect information not contained in a file

Network-based DLP and CASB/SSE tools cannot see end-to-end encrypted messages sent through WeChat. Messaging apps like WeChat increasingly use encryption by default, and applications that use certificate pinning will prevent traditional security tools from inspecting the content.

Traditional DLP’s cannot inspect for specific text (i.e. “tensorflow”) within a compressed ZIP file. Given that the source code Sang downloaded was from Github, it’s possible the format it was transferred via was a ZIP file. Malicious insiders may intentionally obscure content by encrypting the data, hiding it within archives, changing file types, and using additional evasion techniques.

Insider risk and UEBA tools can flag an unusual volume of data downloaded or uploaded, but not the business value of that data (source code vs. vacation photos). Such tools may have detected Sang’s exfiltration behavior but the alert would have been hidden in an ocean of false positive alerts SOC teams struggle to investigate.

Post-incident forensics tools can be used to investigate what actions a user performed, but not stop data exfiltration in progress. The post-forensics tools used by Yahoo also required physical custody of devices, which only prolonged detection, thereby increasing business risk.

How Cyberhaven DDR could have prevented Yahoo’s data breach earlier, and quicker:

First off, Cyberhaven’s Data Detection and Response (DDR) observability capabilities could have allowed Yahoo to track and trace every single piece of data whether contained in a file or copied and pasted directly from an application (i.e. from Github). Cyberhaven tracks data from source and destination, from a rogue employee much earlier. Furthermore, with Cyberhaven, this data tracing can be conducted without access to a rogue employee’s physical device — like Yahoo needed, which resulted in delayed risk management action, while Yahoo waited for Sang’s laptop to arrive.

Secondly, with Cyberhaven, Yahoo could have set specific policies enabling automated alerts of unusual download attempts of source code from web applications (like Github), and sensitive strategy documents from Slack. With Cyberhaven, one could set policies based on various criteria including: download source (i.e. website), destination (i.e. USB drive), action, text classifications and people behavior with regards to business value context (i.e. why is a research team member downloading troves of documents owned by the corporate development team?), prior to the exfiltration.

Even at the point of egress, Cyberhaven could have halted Sang with a gentle warning pop-up message prior to him conducting the exfiltration, offering him the chance to stop himself, in the case it was an unintentional accident. If he continued, Cyberhaven could have automatically blocked the exfiltration of data to a personal USB device outright.

How can Cyberhaven DDR can identify, and prevent other types of breaches?

Extensive Observability via Data Tracing

Cyberhaven tracks and correlates events across the entire life and workflow of the data as it passes between users, machines, and applications. In addition, Cyberhaven can show the flow of sensitive data through multiple channels, including SaaS apps (like Zoom and Slack), network shares, endpoints, and email — starting from creation through egress. Data lineage can be learned retroactively and monitored for the future.

Host & Application Context-Based Data Filters

Cyberhaven analyzes virtually every action performed on a piece of data, such as: edits made to a document, a user that copies data from one file to another, data that is shared across an application, saved to a USB drive, or data that is encrypted or compressed. This ensures that all the important actions are seen even in cases where the content itself is no longer visible.

Real-Time Enforcement

Cyberhaven can identify threats, detect unusual behavior, and give staff the opportunity to enforce policy in real time, before a violation occurs. Options here can include in-line blocking, as well as more user-friendly, less disruptive options such as warning employees or redirecting them to a safer, approved path.

Data Theft Risk is Ever-Increasing

Data has increasingly become the most valuable resource of modern organizations. Cyberhaven’s DDR is uniquely designed to cater to modern enterprises’ data protection needs — in an innovative way that surpasses capabilities of traditional DLP’s and data security tools. Having complete visibility via data lineage and tracing is critical — and that’s exactly how Cyberhaven can prevent data theft from insider threats, like that of what Yahoo suffered.