Who Generates The Most Big Data?

Who Generates The Most Big Data?

What human activity will generate the most data in the future? Media? Financial analytics?

The answer, it turns out, could be in our collective DNA.

Where Big Data is Getting Much Bigger

Human genomics is poised to become one of the largest, if not the largest, sources of new data over the next decade, said Sumit Sadana, Executive Vice President, Chief Strategy Officer and General Manager of Enterprise Solutions at SanDisk® during a keynote speech at Nexenta’s OpenSDx Summit in San Francisco earlier this month.

As a result, genomics will play an instrumental role in one of IT’s biggest issues, namely, the evolution of data center to fully capitalize on the opportunities created by Big Data.

“Human genomics data is doubling every seven months,” he noted. “If you’re in Big Data, you need to see the doctor.”

The prediction comes from a recent paper from a research team lead by Professor Gene Robinson at the University of Illinois that examined the data generation and storage needs of four different fields of endeavor: astronomy, YouTube, Twitter and human genomics. In the study, the researchers found that the largest astronomy labs, YouTube and the 20 largest human genomics labs currently require about 100 petabytes of additional storage per year. (All those tweets, meanwhile, pale in comparison: Twitter only adds about a half a petabyte of storage a year.)

Challenges of the next decade

Now flash forward to 2025. By then, data being generated by a Square Kilometer Array will require an Exabyte of storage a year, a 10x increase. YouTube will likely be adding approximately 2 exabytes a year to keep up with consumer uploads, a 20x increase.

The storage demands for human genomic data, however, could reach 40 exabytes or more, even including anticipated advances in compression. The cost of scanning a single human genome has plunged from over $100,000 a decade ago to under $1,000, making it economically and technically feasible to scan millions, if not billions, of individuals. Some hospitals hope to use this data to develop personalized courses of treatments for cancer patients.

The UI researchers also predict that the number of plant, animal and microbe species scanned will rise from around 37,000 to 2.5 million by then as well. China’s powerhouse BGI has already sequenced 3,000 varieties of rice. (Twitter, meanwhile, will be only topping 17 petabytes.)

These possibilities, of course, will only become real if infrastructure is developed and deployed that can keep up with the new challenges. If plane designed stopped with the biplane, flight would have never taken off.

The future, is fairly clear

Big Data will be no different and we’re already seeing the results. Three years ago, flash was primarily deployed in data centers as a way to accelerate particular applications. Now it is being deployed for primary storage, Sadana noted, because both the total operation cost and total acquisition cost—i.e. the upfront cost of hardware, software and other necessary capital items—is lower because flash optimizes servers and other assets.

Take the infrastructure required for running a 50 terabyte database. A storage system (with RAID 6 redundancy) would require 288 300GB HDDs slotted into 12 enclosures for an effective capacity of 51.6TB. (While ordinarily capacity on a 15K drive might be left fallow because of slow access time, RAID 6 configurations let you use it.) Such a configuration would provide 51,000 IOPS.

By contrast, one could use 24 3.84TB solid state drives to achieve a 55TB effective capacity system. It would fit into four enclosures (technically, it could be reduced but four would make it more convenient.)

The solid state system, would deliver a TCO that is 27% lower and a TCA that is 23% lower. It would also consume less power—1250 watts including cooling versus 8800 watts. But most importantly, it would deliver one million IOPS. Better performance and a lower price.

The “dollars per GB” argument is going away too. Nexenta and SanDisk have created a solution that integrates NexentaStor onto our InfiniFlash all-flash array for an effective price of $1.50 GB raw.

“At these prices disk may not be dead, but it’s heading for the mortician’s parlor,” wrote Chris Mellor of The Register.

The argument, and the future, is fairly clear.

Related Stories

What is the 3-2-1 Backup Strategy?