The Third Phase of Big Data
This post was originally published on Re/code.
Back in the pre-Internet era of the ’70s and ’80s, computer enthusiasts and researchers used “sneakernet” — loading information onto a portable storage device and walking it over, as a way to move large files.
With the advent of Big Data, it’s making a comeback.
“If you speak to any company [in life sciences], it is a major issue,” said Talli Somekh, CEO of AppSoma, a startup developing a data science platform initially targeted at computational biology. “It’s a matter of physics … some of the data sets are so large you just can’t move them to the cloud.”
In fact, some of the largest and technologically advanced companies in the world are increasingly shuttling entire storage systems from one place to another just to be able to share data.
The Big Data Revolution
The demand for higher-capacity, higher-performing systems is driving us toward the third phase of the Big Data revolution. In the first phase, we saw the advent of software technologies like Hadoop and NoSQL for handling extremely large amounts of data. (This phase, of course, is by no means over.)
The second phase began with the proliferation of reliable, economical sensors and other devices for harvesting real-world data. Software apps for mining video streams, images, handwriting forms and other “dark” data belong in this category too: Without them, the data, from a practical perspective, wouldn’t exist.
The third phase will focus on infrastructure. Simply put, we need new hardware, software, networking and data centers designed to manage the staggeringly amounts of data being generated and analyzed by the first two innovations. Hyperscale data centers, software-defined networking and new storage technologies represent the first steps in what will be a tremendous cycle of innovation.
Why New Ideas Need New Infrastructure
Historically, new ideas invariably need new infrastructure. Automobiles fundamentally changed life. But they also required roads, gas stations and freeways. Pavement, invented decades earlier, suddenly was in demand. Light bulbs turned night into day, and spiraling demand for light led to investments and innovations for grids that ultimately covered nations.
The potential benefits of Big Data exist next to the need for significant breakthroughs in storage and networking infrastructure to support it. Think of security cameras. Airport security managers are beginning to discuss the possibility of upgrading them to UltraHD or 4K. With that kind of resolution, accurate, detailed and searchable streams would replace grainy photos and reduce security risks.
With appropriate controls for privacy and anonymity, you could also employ 4K cameras to analyze shopper behavior or pedestrian traffic patterns. 4K is not just for big TVs at CES.
4K, however, also requires an incredible backbone. A single minute of 4K requires about 5.3 gigabytes. On 4k, the 7,000 CCTV cameras in London would generate 52 petabytes, or several Library of Congress’s worth of information a day.
Similarly, teams of physicists developed a distribution system that closely knits flash storage and networking technology in order to provide access to approximately 170 petabyte datasets from the Large Hadron Collider at the CERN Laboratory in Geneva, Switzerland, for research centers around the world. These computing systems have the ability to transfer data from disk to disk across a wide-area network at 100G speeds, which allows particle physicists to analyze data at 73GB per second. With these insights, they are discovering new particles and forces that help explain the makeup of the universe.
Life sciences might be the biggest challenge of them all. A single human genome requires around 200GB of raw storage. Sequencing a million human genomes would thus require approximately 200 petabytes. In 2014, Facebook was uploading 600 terabytes a day. At that rate, it would require Facebook — owner of one of the world’s most power data infrastructures — a year to upload a million human genomes.
Big Data for agriculture? Wheat has more varied genetic structure than humans.
“And that’s just the raw data off the genome sequencer,” noted Somekh. Deeper analysis multiplies the computational requirements, forcing researchers to balance between local and cloud processing. (Forget just getting rid of data — the FDA puts stringent limits on retention.)
Meeting New, Big Data challenges
Traditional storage solutions can provide robust performance, but it often comes with higher infrastructure cost and complexity. To alleviate this complexity, some turn to virtualization.
Virtualization has vastly improved the ROI and utilization of server infrastructure, but even in the most hyperefficient cloud operations like Google’s, 20 percent to 50 percent of computing cycles still go to waste because the processor can’t access data fast enough. This is referred to as system latency. And in most of today’s enterprise data centers, system latency can be even higher, which can cost companies millions or even billions of dollars a year in terms of business lost due to slow transactions.
Hard drives, the last mechanical devices in server rooms other than air conditioners, came out in 1956, just months after the debut of the first computer that didn’t need vacuum tubes. Traditional data center technologies and architectures are simply not built for the scale and speed required for these new, big-data challenges.
Trying to meet the challenge with hard-drive technologies can also be a fiscal and environmental disaster. The NRDC has calculated that U.S. data centers consumed 91 billion kilowatt hours in 2013, or twice the amount of power used by households in New York City. The figure will rise to 140 billion kilowatt hours if left unchecked.
Solid-state technology, or flash, significantly reduces the hardware footprint by more than 90 percent while increasing input-output by 20 times.
Big Data is really one of the magical concepts of our era. Its ability to give us greater insight and understanding of the world around us increases our ability to create a better society. But it is also going to require a tremendous amount of effort behind the scenes to create solutions that can help wield these Big Data stores in a compact, cost-effective, reliable and environmentally conscious way.