Apache Hadoop is gaining popularity and traction in a variety of industries, due to its ability to manage and process large volumes of unstructured data (Big Data) which inundate today’s modern enterprises. We see extensive data analytics implemented on Apache Hadoop environments to empower executives with more accurate and relevant information to help them make better business decisions, faster.
On the application end, Apache Hadoop is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. Hadoop is really designed for Big (and huge) Data. It can scale up from a few servers to several thousands of servers, with each server offering local computation and storage. Basic components that make up Hadoop include the Hadoop distributed file system (HDFS) and Hadoop MapReduce.
How Flash Impacts Hadoop Performance
Typically, Apache Hadoop environments use commodity servers, with SATA HDDs as local storage on cluster nodes.
However, SSDs, when used strategically within an Apache Hadoop environment, can provide significant performance benefit, which directly translates into faster job completion and, therefore, better utilization of the Hadoop cluster. The performance improvement also helps reduce the total cost of ownership (TCO) over the lifetime of the cluster due to higher utilization of the cluster, helping organizations better contend with the onslaught of data generated.
To help quantify this, our team began testing to benchmark the extent to which SSDs can impact Hadoop deployments from both a performance and cost perspective. We conducted a 1TB Terasort benchmark test on a Hadoop cluster using SanDisk® CloudSpeed SSDs.
SanDisk offers a broad portfolio of SSD solutions, each tailored for specific workload needs. The CloudSpeed™ SATA SSDs product family was designed to address the growing need for SSDs that are optimized for mixed-workload applications in enterprise server and cloud computing environments. Specifically, the CloudSpeed 1000 drives can provide a significant performance improvement when running I/O-intensive workloads, especially with a mixed workload that has random-read and -write data accesses compared with traditional spinning hard disk drives (HDDs) and seemed the best fit for Hadoop environment workload needs and our testing.
Flash-Enabled Hadoop Cluster Testing
Our team tested a Hadoop cluster using the Cloudera® Distribution of Hadoop (CDH) on Red Hat® Enterprise Linux. This cluster consisted of one NameNode and six DataNodes, which was set up for the purpose of determining the benefits of using SSDs within a Hadoop environment, focusing on the Terasort benchmark.
We ran the standard Hadoop Terasort benchmark on three different cluster configurations:
- All-HDD configuration: The Hadoop DataNodes use HDDs for the Hadoop Distributed File System (HDFS), as well as Hadoop MapReduce.
- HDD with SSD for intermediate data: In this configuration, Hadoop DataNodes used HDDs as in the first configuration, along with a single SanDisk SSD which was used in the MapReduce configuration for the intermediate MapReduce shuffle/sort data.
- All-SSD configuration: In this configuration, the HDDs of the first configuration are swapped with SanDisk SSDs.
For each of the above configurations, Teragen and Terasort were executed for a 1TB dataset, and the time taken to run Teragen and Terasort was recorded for analysis.
Results – Runtime Performance
The total runtime results for the Terasort benchmark (including the dataset generation phase using Teragen) for the three configurations of our testing are shown in the graphs below, quantifying the improvements seen with the SSD configurations.
Observations from these results:
- Introducing SSDs for the intermediate shuffle/sort phase within MapReduce can help reduce the total runtime of a 1TB Terasort benchmark run by ~32%, therefore completing the job 1.4x faster than it would have been on an all-HDD configuration.
- Replacing all the HDDs on the DataNodes with SSDs can reduce the 1TB Terasort benchmark runtime by ~62%, therefore completing the job 2.6x faster than on an all-HDD configuration.
Cost per Job Analysis
The runtime results for a single Terasort job were used to determine the total number of jobs that can be completed on the Hadoop cluster over a single day. This was then expanded to calculate the number of jobs that can be completed over a longer period of time, specifically, 3 years. The total cost of the Hadoop cluster was then used with the total number of jobs over 3 years to determine the cost per job metric for the Hadoop cluster. This was done for each of three configurations tested.
These calculations help us compare the cost of the three Hadoop cluster configurations, as can be seen in the chart below:
Observations from these results:
- Adding a single SSD to an HDD configuration and using it for intermediate data reduces the cost per job by 22%, compared to the all-HDD configuration.
- The all-SSD configuration reduces the cost per job by 53% when compared to the all-HDD configuration.
- Both the preceding observations indicate that, over the lifetime of the infrastructure, the total cost of ownership (TCO) is significantly lower for the SSD Hadoop configurations.
In summary, our tests showed how SSDs are beneficial for Hadoop environments that are storage-intensive and specifically those that see a very high proportion of random accesses to the storage.
SSDs can be deployed strategically in Hadoop environments to provide significant performance improvements and TCO benefits to organizations, specifically as SanDisk SSDs plug into standardized interfaces for SAS, SATA, and PCIe directly. The performance gains will speed up the time-to-results for specific workloads that benefit most from the use of flash SSDs in place of HDDs.
You can learn more about the Apache Hadoop Terasort benchmark test we performed with SanDisk SSDs and our results by downloading the detailed technical paper ‘Increasing Hadoop performance with Solid State Drives’ from the SanDisk.com website. You can reach out to me at [email protected] if you have any questions, and join the conversation with us on flash and big data on Twitter at @SanDiskDataCtr.