Hadoop technology just turned 10. During my career, I was fortunate to work for Yahoo where Hadoop was born– from a technical paper to the first operational enterprise data platform. I built the very first research engineering team from the ground up to form the first operational grid clusters with Hadoop running and fully operational, handling operational loads in the order of petabytes of data. I was also part of the team working on the open-sourcing of Apache Hadoop.
Over the years, Hadoop has gained tremendous momentum, giving birth to many distributions with wide adoption across enterprises. It has become completely integrated into the de facto Big Data platform stack. It’s robust, very reliable, scalable, and enterprise grade. However, as progress marches on, some components of the traditional architecture are coming of age, and a new approach to Hadoop architecture is emerging.
Shared vs. Distributed HDFS
HDFS has served as the primary storage system used by Hadoop. It is a distributed file system that provides high performance access to data across Hadoop clusters. It has become the distributed file system of choice for many enterprises managing large pools of Big Data, and enabling Big Data analytics applications.
But the nature of progress is that technology is in continuous evolution. More compelling systems emerge with better architectures and better storage.
So what’s the right Hadoop architecture for your Big Data analytics – shared or distributed?
What’s Right for Me?
In a recent webinar, I compared and contrasted the two current approaches. The original HDFS approach utilizes storage co-located with the compute servers. An emerging alternative relies on dedicated storage resources shared by the compute cluster.
I wanted to provide definitive guidelines to planners and architects in order to help them identify the best solutions for their needs when implementing Hadoop.
You can stream the webinar, on-demand, for free. Feel free to reach out in the comments below with your questions.
WEBINAR: Shared or Distributed HDFS – What’s Right for Me? Stream it here
Janet is Fellow and Chief Data Scientist at WDC with 15+ yrs in Big Data platform, machine learning, distributed computing, compilers and AI