- Dark data is the information organizations collect and store during regular business activities but never analyze or use, often representing more than half of total enterprise data.
- The security risk is fundamental: you cannot protect data you do not know exists, and attackers will find it before you do.
- Dark data spans structured, unstructured, and semi-structured formats, accumulating in legacy systems, cloud repositories, SaaS tools, and forgotten file shares.
- Compliance obligations under regulations such as GDPR and HIPAA apply to dark data regardless of whether the organization knows it is there.
- Discovering, classifying, and either activating or deleting dark data is a foundational step in any mature data security posture program.
What Is Dark Data?
Dark data is information that organizations collect and store during regular business activities but generally fail to use for analytics, decision-making, or any other purpose. The term comes from Gartner, which defined it as data that "often comprises most organizations' universe of information assets" and is typically retained for compliance reasons while creating more cost and risk than value.
Unlike the data flowing through dashboards and active business systems, dark data is dormant. It accumulates in server log archives, legacy customer relationship management (CRM) databases, email repositories, cloud storage buckets, and any other location where data lands but no one returns to examine it. Research by Splunk found that 60% of organizations report half or more of their data is dark, with one-third saying that figure reaches 75% or higher.
Dark data matters now because the conditions that create it have accelerated. Cloud adoption, SaaS proliferation, and the explosion of machine-generated data have all made it easier to collect and store far more than any organization can analyze. At the same time, data protection regulations have grown stricter, and attackers have grown more deliberate in seeking out unmonitored corners of an enterprise environment. Dark data sits at the intersection of these pressures.
How Dark Data Accumulates
Dark data builds through a set of recurring organizational patterns. When cloud storage costs dropped, organizations adopted a default posture of retaining everything on the assumption that data might prove valuable later. Different departments build isolated repositories with no shared catalog, so data that could benefit another team stays trapped behind departmental walls.
Data without metadata (i.e. ownership records, sensitivity labels, creation dates) is effectively undiscoverable: Governance tools have nothing to surface and enforcement has nothing to act on. And when infrastructure is modernized, older databases are retired without migrating or deleting the records they contain, leaving data outside active monitoring.
Types of Dark Data
Dark data spans all three structural formats, each presenting different discovery and management challenges.
- Structured dark data has a defined schema and lives in organized systems, but is never queried. Examples include transaction records in legacy enterprise resource planning (ERP) systems, archived customer records from discontinued products, historical CRM entries, and financial records retained solely for compliance without any analytical use. Structured dark data is theoretically easy to discover once you know it exists, but it often sits behind permission barriers or in systems that are no longer integrated with modern tooling.
- Unstructured dark data is the largest and fastest-growing category. It includes email correspondence, customer service call recordings, surveillance video footage, scanned documents, slide decks, and PDFs. Unstructured data does not fit neatly into database columns, which makes automated discovery harder.
- Semi-structured dark data falls between the two. It contains some organizational markers but lacks a rigid schema. Examples include JSON and XML files from API interactions, web server logs with mixed field types, Internet of Things (IoT) sensor streams with varying message formats, and email metadata paired with unstructured message content.
Each format requires a different discovery approach. Structured data benefits from database scanning. Unstructured data requires content inspection, natural language processing, or computer vision to identify what it contains. Semi-structured data often requires custom parsing before it can be classified.
Why Dark Data Matters for Data Security
Dark data is a security liability. Organizations cannot enforce access controls on data they have not inventoried, cannot apply encryption to files they do not know exist, and cannot respond to a breach involving data that was never cataloged. Attackers operate with no such constraint: they probe wherever access controls are weakest, and dark data in abandoned cloud buckets, forgotten file shares, and unmonitored legacy systems is a low-resistance target.
Compliance obligations apply to dark data regardless. Regulations including GDPR, HIPAA, and the California Consumer Privacy Act (CCPA) govern all stored personal data, not only data an organization actively manages. An undiscovered archive containing personally identifiable information (PII) from a decade ago still triggers subject access rights, breach notification obligations, and retention limits. IDC found that nearly half of all organizational data is considered sensitive or confidential, yet only 32% of respondents had more than 75% of their sensitive data mapped and monitored.
The attack surface also expands with volume. Redundant, obsolete, or trivial (ROT) data can conceal malware or backdoors placed by threat actors who gained earlier access. Insider threats benefit disproportionately from dark data because departing employees or malicious insiders can exfiltrate records that security teams have no baseline to detect as missing.
Common Dark Data Risks and Misconceptions
"We Would Know If We Had a Problem"
Many organizations assume that because their active systems are monitored, their overall security posture is sound. Dark data breaks this assumption. An organization can have mature controls on its known data environment while carrying years of unmonitored sensitive records in decommissioned systems or unindexed cloud storage.
Treating It as a Storage Problem Only
Dark data is commonly discussed in terms of storage cost, which is real but secondary. The security and compliance exposure of unmanaged sensitive data is the more pressing concern. An organization that deletes dark data purely to save money without first assessing its sensitivity risks destroying records that have compliance value, while an organization that only focuses on cost misses the liability dimension entirely.
Assuming Encryption Solves It
Encrypting data at rest addresses one attack vector, but it does not make dark data visible or manageable. Security teams still cannot classify it, cannot apply access controls, and cannot respond to a breach if they do not know the data exists. Encryption without discovery is incomplete.
Underestimating Unstructured Volume
Structured databases draw attention because they are visible and queryable. The unstructured data accumulating in shared drives, collaboration tools, email archives, and endpoint file systems is far larger by volume and far harder to govern. Teams that focus discovery efforts only on databases routinely miss the majority of their dark data.
How to Manage Dark Data
Effective dark data management follows a sequence: find it, classify it, act on it, and prevent future accumulation.
Step 1: Run Discovery Across the Full Data Estate
Start with a discovery scan that covers all data repositories: cloud storage, SaaS applications, on-premises databases, file servers, email systems, and endpoint devices. Do not limit discovery to known or approved locations. Orphaned cloud buckets, personal devices used for work, and legacy systems that IT no longer actively manages are all common dark data sources. The goal is a complete inventory, including data the organization did not realize it had.
Step 2: Classify What You Find
Apply sensitivity labels to discovered data using content-based inspection, contextual signals (who created the file, what application generated it, where it is stored), and user review for high-stakes edge cases. Classification determines which dark data carries the highest risk: records containing PII, protected health information (PHI), financial data, or intellectual property require immediate governance. Data with no identified sensitivity or business value is a candidate for deletion.
Step 3: Prioritize by Value and Risk
Not all dark data warrants the same response. A four-quadrant framework works well here:
- High value, high risk: Activate with strict governance controls
- High value, low risk: Activate and make accessible to appropriate teams
- Low value, high risk: Secure immediately, then delete or archive to cold storage
- Low value, low risk: Archive or delete to reduce storage footprint and attack surface
Step 4: Enforce Retention and Deletion Policies
Set automated retention schedules so data does not persist past its compliance-required or business-useful life. ROT data should be deleted on a defined schedule rather than left to accumulate. Without enforced deletion, any discovery program will simply restart from a growing baseline.
Step 5: Prevent Future Dark Data
Discovery and deletion are not one-time exercises. Implement continuous monitoring so new data is classified as it arrives rather than after it has been sitting for years. Establish data governance policies that define how data is labeled, who owns it, and how long it is retained. Connect classification labels to data loss prevention (DLP) policies and access controls so that sensitive data receives protections from the moment it is created.
Explore A Practical Guide to Modern DSPM for how continuous data understanding replaces point-in-time scans in AI-era security programs.
How Cyberhaven Addresses Dark Data
Cyberhaven's approach to dark data starts with Data Lineage, a proprietary technology that tracks data from creation through every copy, transformation, and movement across cloud, SaaS, on-premises, and endpoint environments. Most dark data is not data that was born invisible: it is data that became invisible through movement. A document created in a sanctioned system ends up copied to a personal folder, forwarded externally, or archived in a forgotten location. Lineage traces that movement, surfacing data that static discovery scans miss.
Cyberhaven DSPM uses this lineage foundation to continuously discover and classify sensitive data across the enterprise. When dark data surfaces, DSPM adds lineage context: security teams understand not just that unmonitored sensitive records exist, but which systems they passed through and which users touched them. This context shortens investigation time and improves remediation accuracy. For compliance, Cyberhaven DSPM generates audit-ready reporting tied to specific data assets and access histories, supporting GDPR, HIPAA, PCI DSS, and CCPA requirements.
Cyberhaven DLP capabilities extend protection to data in motion, applying real-time policies that prevent classified data from moving to unapproved destinations, including data freshly surfaced from previously dark locations.
Data Lineage: Next-Gen Data Security Guide walks through how tracking data origin, movement, and transformation closes the visibility gaps that create dark data risk.
Frequently Asked Questions
What is dark data?
Dark data is information that organizations collect and store but never analyze or use for analytics, decisions, or business purposes. Gartner coined the term to describe data retained for compliance that creates more cost and risk than value. Dark data spans structured, unstructured, and semi-structured formats, typically representing more than half of total enterprise data. Its defining characteristic is invisibility: it is not monitored, classified, or governed.
What are examples of dark data?
Common dark data examples include archived server log files never reviewed, customer service call recordings stored but not analyzed, email correspondence in unprocessed archives, transaction records from discontinued product lines in legacy ERP systems, IoT sensor data collected but not integrated into operations, and scanned documents in shared drives with no metadata. Any data collected and stored without an active analytical use case qualifies.
What is the difference between dark data and shadow IT?
Shadow IT refers to unapproved tools and systems. Dark data is the data that accumulates in any system, approved or not. An organization can have substantial dark data entirely within its sanctioned, IT-managed infrastructure. Shadow IT is a governance problem at the tool level; dark data is a governance problem at the data level. The distinction matters because dark data requires discovery and classification regardless of which system holds it.
Why is dark data a security risk?
Dark data is a security risk because security controls require the data to be known, inventoried, and classified. Unmonitored dark data receives no access controls, encryption, or DLP policy coverage, making it a target for external attackers and insiders. It also creates compliance exposure: GDPR, HIPAA, and similar regulations apply to all stored personal data, including data the organization does not know it has. A breach involving undiscovered sensitive data may be impossible to detect or report accurately.
How is dark data related to data security posture management (DSPM)?
DSPM directly addresses the dark data problem. DSPM platforms continuously discover and classify data across cloud, SaaS, and on-premises environments, surfacing sensitive data in unmonitored locations. The "posture" in DSPM includes visibility into data the organization did not previously know it had, which is the definition of dark data. Without DSPM or equivalent discovery capabilities, organizations cannot identify dark data, assess its sensitivity, or remediate the security and compliance risks it creates.
What regulations apply to dark data?
Data protection regulations apply to all stored personal or regulated data, whether or not the organization actively manages it. GDPR requires organizations to know where personal data is stored and fulfill data subject rights including access and deletion requests; dark data makes accurate responses impossible. HIPAA requires safeguards for protected health information regardless of active use. CCPA creates parallel obligations for California residents' personal data. Retention schedules, breach notification, and data minimization requirements all extend to data that was stored and forgotten.




.avif)
.avif)
