Within our SanDisk® labs, I conducted a number of experiments with Apache Hadoop and SanDisk flash, mainly our CloudSpeed Ascend SATA Solid State Drives (SSDs). The initial experiments were with standard Hadoop benchmarks, namely the Terasort and the TestDFSIO benchmarks. These benchmarks helped show me how SanDisk SSDs helped boost the performance of the Terasort and TestDFSIO jobs. I was also able to extrapolate the performance benefits to show lower cost of ownership for SSDs in the Hadoop environment by focusing on the cost per job. These benchmarks were great starting points to study Flash within the Hadoop ecosystem.
Continued research for Hadoop benchmarks which closely reflect real-world workloads then brought me to the SWIM benchmark. This benchmark provides a repository of real-life MapReduce datasets from production systems. It also provides tools to generate representative workloads that operate on these real-life datasets. The benchmark allows rigorous performance and stability testing of MapReduce systems.
An abbreviated version of the SWIM benchmark also forms the basis of the Cloudera Hardware certification test suite, which provides a strong statement of partnership and confirmed interoperability of technologies to the thousands of customers around the world running Cloudera.
Cloudera Hardware Certification
To continue my testing efforts with the goal of using real-life datasets and workloads, and nurture the strong partnership between Cloudera and SanDisk, I attempted the Cloudera Hardware certification test suite on the SanDisk labs cluster.
If you are not familiar with the program, “the Cloudera Certified Technology program was created to make it simpler for Apache Hadoop technology buyers to purchase the right components and software applications to extract the most value from their data.
Building a Hadoop cluster from the ground up can be challenging. There are numerous choices to be made at all levels of the stack and making those choices can be complex. The Cloudera Certified Technology program is designed to make choosing the right technology easier.” (You can learn more about Cloudera Certified Technology on the Cloudera website)
The lab cluster used the Cloudera® Distribution of Hadoop (CDH), version 5.1, and had one NameNode, and eight DataNodes which were populated with SanDisk CloudSpeed Ascend™ SATA SSDs. The cluster was setup as per the test suite requirements and recommendations. The certification test suite completed without any failures. The results along with the necessary diagnostics were submitted to Cloudera to obtain official certification for the CloudSpeed Ascend SATA SSDs.
With this testing complete and confirmed with Cloudera, I am proud to announce that our CloudSpeed Ascend SATA SSDs are a Cloudera Certifed Technology Product!
For customers, Cloudera Certified Technologies such as SanDisk SSDs operate with lower risk and lower total cost of ownership (TCO) and comply with Cloudera development guidelines for integration with Hadoop ensuring better, trusted value.
Madhura has numerous years of experience in software design and development, and technical and solutions marketing.At SanDisk, she currently works in Systems & Software Solutions as a Product Manager. Prior to that, she was the Technical Marketing Manager for Hadoop and Big-Data Solutions, focused on developing technical white papers, best practice guides, reference architectures and other technical collateral for Hadoop and Big-Data Solutions using SanDisk enterprise flash products.Before joining SanDisk, she worked on software design and development for high-availability clusters, file system performance analysis and engineering, and technical and solution marketing for storage products. She has worked at Sun Microsystems (now Oracle Corporation), NetApp, BlueArc Corporation and Hitachi Data Systems.Madhura earned a Bachelor’s degree in Computer Engineering from the University of Pune, India and a Master’s degree in Computer Science from State University of New York, Binghamton.