Scaling and preserving data is a cost and management challenge. See how Hadoop® S3A and on-premises object storage allow you to massively scale Hadoop with extreme data durability and low storage overhead. 

The old paradigm that data becomes less valuable over time is shifting. Most data has a hidden value still to be mined, transformed and analyzed. With data growing at an unprecedented rate, organizations need to find ways to scale IT to store data longer – or forever. This is especially pertinent for Big Data initiatives using Hadoop or other analytics platforms.

Hadoop on HDFS and Object Storage

HDFS protects data against disk or node failures by storing multiple (typically three) full copies of the data on different disks. While this provides data locality at multiple nodes, this multicopy data protection strategy consumes a lot of disk capacity. This makes HDFS expensive to use at larger scales.

In contrast, object storage is typically the best solution for massive data as it provides a superior method to protect data using erasure coding and scale to petabytes at much lower costs. Over half of companies we surveyed already use object storage for analytics today and 95% plan to use object storage for their analytics in the next 18 months[1].

Traditionally, Hadoop has been run on HDFS in order to manage distributed data storage and provide data locality for processing jobs. However, the move of Hadoop processing to dedicated compute clusters enables new possibilities for scaling, such as using Hadoop S3A and object storage.

Hadoop S3A

S3A is an open source connector for Hadoop, based on the official Amazon Web Services (AWS™) SDK. It was created to address the storage scaling and cost problems that many Hadoop users were having with HDFS.

Hadoop S3A allows you to connect your Hadoop cluster to any S3 compatible object store—in the public cloud, hybrid cloud or on-premises. This allows you to create a second tier of storage for offloading data, where the cost per terabyte is much lower.

The goal of S3A is not to replace HDFS, though you certainly could if you wanted to. The goal is to provide a second tier of storage that is optimized for cost and capacity. You don’t have to throw away valuable data just to free up disk space.

ActiveScale™ Object Storage for Hadoop

Western Digital’s ActiveScale system is architected to scale elastically, and cost-efficiently, to store billions of objects with petabyte capacity in a single global namespace. The system’s erasure coded volumes offer more than double the usable capacity of most HDFS volumes that use triple-replication, and delivers 19 nines of data durability (that’s a whole lot more nines than the public cloud!).

Due to the use of an erasure code with an overhead of 1.6x, instead of 3x as with replication in HDFS, the write path is also very fast. And because of its linearly scalable performance and massively parallel access capability, it delivers the required performance for large Hadoop analytics workflows.

Hadoop S3A Configuration with ActiveScale Object Storage System

The ActiveScale system is engineered to overcome some of the limitations of HDFS, and to seamlessly integrate into the Hadoop ecosystem through the S3A connector.

From the user’s perspective, it looks just like any other directory in HDFS. You can copy files to it, run MapReduce jobs on it, or store Apache Hive™ tables there. Hadoop analytics jobs can read objects presented as files directly from ActiveScale buckets, and commit results back to ActiveScale where the data is immediately protected with extreme durability.

In the video demonstration below Mike McWhorter walks you through how to expand the storage capacity of your Hadoop cluster using S3A and ActiveScale object storage system. His demonstration is based on a four node Hadoop cluster running Cloudera CDH 5.11, and the ActiveScale P100 system.

Build a Better Data Lake

The economics of data are changing. Companies are gaining valuable insights from massive amounts of data over longer periods of time. Infrastructure needs to be an enabler for this new “data forever” paradigm, and traditional IT infrastructure can’t compete.

Western Digital’s ActiveScale presents a better massive scale storage solution for pairing with Hadoop because it is designed as a scalable, erasure coded object storage solution, with extreme data durability and low storage overhead. No other storage company can match Western Digital’s unique component-to-system vertical innovation and integration to disrupt the economics of data, and help your data thrive.

HDFS Tiered Storage at DataWorks Summit

Our team will be at speaking at the upcoming DataWorks Summit in San Jose.

Join us for a technical session on HDFS Tiered Storage on Tuesday, June 19 11:50 AM – 12:30 PM In Executive Ballroom 210A/E

This session will go over the PROVIDED storage tier functionality to mount external storage systems in the HDFS NameNode. Our experts will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.

More information here.

Learn About Object Storage in Our Upcoming Webinar

Want to understand more about object storage on premises? What problem is it solving; what makes it a better storage solution than SAN or NAS, and in which use cases? Join us July 10th, or on demand after. Save your seat below!


[1] The Growing Role of Object Storage in Solving Unstructured Data Challenges, HGST and 451 Research, November 2017


All things data with news and insights on systems and technology that help you capture, preserve, access and transform your data.