Better Data Lake with Apache Hadoop® on Object Storage

April 10, 2018 5 min read Data

In the hyperscale world, data sets can get extremely large, frequently approaching the petabyte range. For companies who deal in big data, finding enough disk space to store everything can be a major challenge. To complicate matters, scale-out file systems, like HDFS, rely on replication to protect your data. When you’re working at the petabyte scale, making multiple copies of your data set is not an efficient way to manage disk space. When these file systems inevitably fill up, users are forced to make difficult decisions about their data, often deleting valuable information just to free up disk space.

[Tweet “Building a better #datalake with Apache #Hadoop on Object Storage #bigdata #DWS18”]

In the past, if you wanted to add storage to Hadoop®, your only option was to expand your cluster, purchasing additional servers and software licenses (whether you needed them or not) just to get the extra disk space. For users who just need to increase their available disk capacity, buying more servers is not an efficient way to scale.

Thankfully there is a new option – S3A.

Hadoop on Object Storage using S3A

S3A is Hadoop’s new S3 adapter. It was created to address the storage problems that many Hadoop users were having with HDFS. S3A allows you to connect your Hadoop cluster to any S3 compatible object store, creating a second tier of storage. With S3A, you can offload your data from HDFS onto object storage, where the cost per TB is much lower. From the user’s perspective, it looks just like any other directory in HDFS. You can copy files to it. You can run MapReduce jobs on it. You can store Hive tables there. But it has more than double the density of HDFS, with more redundancy and better scaling properties. The goal of S3A is not to replace HDFS, though you certainly could if you wanted to. The goal is to provide a second tier of storage that is optimized for cost and capacity. Now you don’t have to throw away valuable data just to free up disk space. You can archive it.

Why This is Better than Traditional Scale-Out

This offers several advantages over the traditional scale-out approach. The first advantage is density. Object stores use erasure coding to protect your data. If you’re unfamiliar with erasure coding, you can think of it as the next generation of RAID. In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk. Additional shards are added, which contain error correction information and provide protection from disk failures. Erasure coded volumes offer more than double the usable capacity of volumes that use triple-replication.

Another advantage is ease of deployment. You don’t have to scale your Hadoop cluster just to add storage. In fact, you don’t have to make any modifications to your production system at all. Just plug the object store into your network and edit the core-site.xml file in your Hadoop environment. Storage can be added at any time without disrupting your day to day operations. It’s far less burdensome than integrating 50 new servers into your cluster.

Lastly, and perhaps most importantly, is the cost savings. Object stores are designed to be cheap and deep. They offer one of the lowest cost per TB of any storage approach and allow you to scale your storage independently of compute resources. Now you don’t have to buy unnecessary servers, CPUs, memory, and software licenses to get more disk space. Object storage provides you with enough capacity to store all of your data, while keeping your costs as low as possible.

More Data = More Insights

Information is a valuable commodity. The more data we collect and study, the more we can learn about users, and make informed business decisions that drive success. But in order to do this effectively, we need a smart storage solution that can handle the volume without putting an unnecessary strain on our IT budget.

ActiveScale^TM from Western Digital is an industry-leading object storage system, designed from the ground up to handle massive scale. With advanced erasure coding techniques to maximize your usable disk space, it is the ideal choice for petabyte-scale Hadoop environments. We are in the unique position of being the only vendor that manufactures nearly all of the components throughout the entire storage technology stack. This allows us to offer some of the most aggressive pricing in the industry and lead the charge to help businesses gain greater insights and transform data and themselves.

Mounting Object Stores in HDFS

Join us next week at Hortonworks’ DataWorks Summit in Berlin. Western Digital storage architect and software engineer will present how the Hadoop admin can manage storage tiering between HDFS and external namespaces, both object stores and other HDFS clusters, and how that is then handled inside HDFS snapshotting and storage policy.

Learn more about our session here.

Better Data Lake with Apache Hadoop® on Object Storage

Hadoop on Object Storage using S3A

Why This is Better than Traditional Scale-Out

More Data = More Insights

Mounting Object Stores in HDFS

Related Stories

A Balancing Act: HDDs and SSDs in Modern Data Centers

What is the 3-2-1 Backup Strategy?