The Economics of Dark Data: Unlocking Value for AI and Machine Learning
Key takeaways
- Dark data—unused but stored enterprise information—can hold untapped value for AI and business intelligence.
- AI and machine learning have transformed dark data into an asset, enabling predictive insights, model training, and improved GenAI accuracy through RAG systems.
- HDDs’ ability to combine fast retrieval, high capacity, and low TCO make them the foundation for 80% of data center storage—and deliver optimal economics for dark data workloads.
- Storage tiering is now ubiquitous, allowing enterprises to balance performance and cost by dynamically placing workloads on flash, HDD, or tape based on IOPS and throughput requirements.
- Western Digital’s NVMe-oF solutions (RapidFlex and OpenFlex Data 24) enable disaggregated storage, giving enterprises the flexibility to match storage technology to specific workloads and future-proof their infrastructure.
“The internet is forever.”
That’s the cautionary message we often deliver to the young people in our lives. The takeaway being, once you’ve put something publicly on the internet, you no longer control it. It may live on, forever, and be used or amplified in ways you never imagined—so you have to be careful.
But in the world of enterprise data storage, it can be a different message—one not of caution but of economics and strategy.
Enterprises used to ask what data they could afford to store; now many realize that for much of their data, they can’t afford to delete it.1 Whether the data may be needed for financial, regulatory, or compliance reasons, or whether it must be stored because its value has yet to be recognized.
What is dark data and why does it matter?
Enterprises today are awash in “dark data”—data they store in perpetuity but aren’t actively using. The question then arises: How can enterprises make use of this data?
Historically, there wasn’t much of an answer. Dark data was often walled off in application-specific silos, unstructured data formats, or proprietary databases. The data was often confidential and/or proprietary, and as such not suitable or economical to migrate to the cloud. The data was also often scattered, and its breadth and depth made it inscrutable to human analysis or categorization.2 It may have been appropriately and diligently stored but was mostly inert and under-utilized.
How AI and machine learning unlock dark data value
All that is changing. Machine learning algorithms have enabled enterprises to extract valuable business intelligence from this dark data. The data can train or fine-tune generative AI models, making them domain-specific and unique to particular businesses. Even companies using pre-trained large language models (LLMs) in their business processes can mine pools of confidential or proprietary dark data for use in retrieval-augmented generation (RAG) cross-referencing libraries, thereby improving the accuracy of GenAI output.3
Estimates vary based on industry but suggest that dark data may account for anywhere between slightly under 50% or above 90% of all enterprise stored data.4 This represents a huge pool that may be leveraged for critical business insights or mined for use in machine learning or generative AI.
But this is not a trivial task. As TechRadar notes, “The challenge lies not in the accumulation of data, but in the effective extraction of actionable intelligence. Artificial Intelligence (AI) serves as the transformative tool, capable of converting this ‘dark data’ into tangible business value.” Some initial challenges facing IT professionals are how to break through siloed data, how to categorize unstructured data, and how to rebuild it into a more useful resource.
This change then introduces new challenges for enterprise storage architects. As dark data is metaphorically being brought into the light, this formerly dark data is warming up, and the workloads are changing. Enterprises need to reassess their data storage strategies to ensure that their storage pools are architected to allow AI-informing data mining to make the most of dark data reservoirs and that this storage is performant, economical, and reliable.
Architecting storage for dark data: HDD vs. flash
It’s easy to think of performance solely in relative terms. A usual answer to “who is the fastest person on the planet?” is Jamaican sprinter Usain Bolt, the world-record holder in the 100m and 200m dashes. But another answer might be Kenyan long-distance runner Eliud Kipchoge, the first and only person ever timed to run a marathon in under two hours. They’re both fast, albeit in different ways.
In storage, we need to think about performance the same way: flash storage could be considered the quick sprinter while HDD is the consistent endurance runner.
Dark data is a perfect example of data that has historically been stored optimally on HDD. HDD has fast retrieval compared to other archival technologies, and low TCO for online/connected data. This balance makes enterprise-grade HDD storage the optimal storage medium for most workloads such as machine learning, data preparation for AI training/fine tuning, and even RAG databases.
This same blend of performance and economics makes HDD suitable for most enterprise and data center workloads, from cloud storage as a service (STaaS) businesses, to the data lakes supporting AI model training. The result is that HDD is the overwhelming capacity leader in the modern data center.
Today, nearly 80% of installed worldwide data center storage capacity is HDD, with the remainder split between flash and tape.5 Western Digital continues to invest in HDD innovation to help ensure that capacities and performance keep pace with the ever-expanding storage requirements of modern storage workloads.
Storage tiering: balancing performance and TCO
But performance is only part of the picture—one that we must balance against TCO. Flash storage can cost 6x or more to acquire at scale, greatly contributing to its higher TCO compared to HDD storage.6
Based on this TCO gap, storage architects typically tier their workloads to place the most IOPS-intensive workloads on flash, and the most throughput-intensive workloads on HDD. A typical data center deployment will start with HDD as the default media platform for data storage needs, and then, as needed, augment that storage for specific use cases.
In the world of modern data centers, tiering is no longer a discussion point: it’s ubiquitous. Data storage is split between the three major technologies: flash, HDDs, and magnetic tape.
At the cloud/hyperscale level, cloud service providers (CSPs) manage massive storage pools of all three technologies to ensure frictionless continuity. They offer storage services with specific service level agreements (SLAs) to their customers and can dynamically tier as needed as long as they remain within the SLA.
Most of the hyperscalers also have their own captive storage workloads and dynamically tier this data as well. These vendors, who are among the biggest storage customers in the world, understand that storage at scale requires a balance of performance and TCO.
Of course, tiered storage is not the sole province of CSPs. At all scales, tiered storage is a way to balance performance and TCO. Storage system vendors offer managed hybrid storage products that seamlessly blend flash and HDD storage to hit the capacity and performance windows that their customers require. This allows customers who are keeping data on premises to store their dark and formerly dark data economically while ensuring that their hottest data is available and performant when they need it.
Western Digital’s solutions for dark data management
Recognizing that one size fits most, but not all, Western Digital through our Data Center Platforms division is driving the move to disaggregated storage via NVMe™-over-Fabrics (NVMe-oF.) Through the RapidFlex family of NVMe-oF initiators and targets, and through the OpenFlex Data 24 NVMe-oF Storage Platform, we’re enabling enterprises to match the right storage technology to the right workload.
Whether data is retained in the cloud or on premises, storage tiering is now a common and expected capability. This means that whatever the new workload requirements imposed upon dark data as it is put to use, enterprises large and small will have storage options that fit. While most of that data will likely remain on HDD,7 tiered storage architectures will provide flexibility where different needs arise.
Future-proofing dark data with storage tiering
Over the past several decades, the capacity gains and cost efficiency of modern storage have moved enterprises into the “keep everything” mindset with their data. However, for many of them, their data lived in the cold and dark and was only brought out when companies knew there was a solid business case for its use.
With the growth of machine learning and stunning emergence of generative AI, however, this dark data has been brought into the light, becoming a critical resource that enterprises can mine for valuable business intelligence and for more accurate and relevant AI output.
This transformation of dark data is inspiring enterprises to examine their storage hierarchy to make sure all of their data resides somewhere where it is performant, cost-effective, and safe.
Storage tiering helps ensure that the demands of all workloads can be met, but the economics are clear. Whether the data is in the cloud or on-prem, HDDs will continue to unlock value while delivering the right balance of performance and TCO for nearly all dark and formerly dark data.
- https://spacetime.eu/blog/your-business-doesn-t-need-to-choose-what-data-to-keep-and-what-to-delete
- https://medium.com/@v2solutions/dark-data-the-hidden-cost-of-unused-enterprise-data-and-how-to-monetize-it-a0b55bf5eece; https://www.feld-m.de/en/blog/dark-data-cost-poor-data-quality/
- https://www.intalio.com/blogs/unlocking-dark-data-how-machine-learning-classifies-and-governs-unstructured-enterprise-content; https://www.cio.com/article/404526/unlocking-the-hidden-value-of-dark-data.html; https://centific.com/blog/your-dark-data-is-valuable-if-you-know-how-to-unlock-it; https://laiyertech.ai/blog/index.php/2025/09/04/rag-how-to-use-proprietary-documents-safe-and-secure-with-ai/
- https://www.cio.com/article/404526/unlocking-the-hidden-value-of-dark-data.html
- Source: IDC, Worldwide IDC Hard Disk Drive Forecast, 2025–2029, doc #US53465525, June 2025 and IDC, Worldwide IDC Solid State Drive Forecast Update, 2025-2029, doc #US52455725, June 2025
- Source: IDC, Worldwide IDC Hard Disk Drive Forecast, 2025–2029, doc #US53465525, June 2025
- Source: IDC, Worldwide IDC Hard Disk Drive Forecast, 2025–2029, doc #US53465525, June 2025
