Why Object Storage? A Short, Definitive Explanation

Why Object Storage? A Short, Definitive Explanation

If you do an internet search on object storage, you’ll find plenty of explanations on what object storage is, but the one thing no one tells you is why you would want to use it in the first place.  If you’re a systems engineer, that’s not particularly helpful.  When I first heard about object storage, I spent hours researching it on the internet, trying to figure out what problems it was meant to solve.  Yes, it stores files as objects instead of file system blocks, but so what?  If you expect me to abandon my existing storage solution for an object store, you’re going to have to give me a pretty compelling reason.  So what is it? If you study the problem for long enough, you’ll eventually be able to piece it together, but I think I can save you some time.

[Tweet “If you’re a systems engineer wondering why object #storage and what it’s good for, we have answers”]

Why Object Storage?

The main problem object storage is trying to solve is scalability.  We are experiencing an explosive growth in data right now.  In 2017, over half of the IT organizations we surveyed said that they have storage capacities of 50PB or more[1].  The rate at which we generate data is constantly increasing, which means that this problem is only going to get worse as time goes on.  If you’re one of the engineers who’s responsible for storing and protecting all of that data, this is a big problem because conventional storage techniques just weren’t designed to handle this kind of scale.  Let me explain…

Traditional Storage Approaches

The traditional approach is to use SAN and NAS systems for storage.  This works well for smaller data sets, but it falls apart when you attempt to scale it.  The problem is the file system.  Traditional block-based file systems use lookup tables to store file locations.  They break each file up into small blocks, generally 4k in size, and store the byte offset of each block in a large table.  This is fine for small volumes, but when you attempt to scale to the petabyte range, these lookup tables become extremely large.  It’s like a database.  The more rows you insert, the slower your queries run.  Eventually your performance degrades to the point where your file system becomes unusable.  When this happens, users are forced to split their data sets up into multiple LUNs to maintain an acceptable level of performance.  This adds complexity and makes these systems difficult to manage.

What About Scale-Out File Systems?

To solve this problem, some organizations are deploying scale-out file systems, like HDFS.  This fixes the scalability problem, but keeping these systems up and running is a labor-intensive process.  Scale-out file systems are complex and require constant maintenance.  In addition, most of them rely on replication to protect your data.  The standard configuration is triple-replication, where you store 3 copies of every file. This requires an extra 200% of raw disk capacity for overhead!  Everyone thinks that they’re saving money by using commodity drives, but by the time you store three full copies of your data set, the cost savings disappears.  When we’re talking about petabyte-scale applications, this is an expensive approach.

The Public Cloud

To minimize the complexity, many organizations are using public cloud storage.  The pay-as-you-go model offered by hosting providers makes sense for smaller data sets, but can be extremely expensive when you attempt to deploy it at scale.  The public cloud can come with high egress fees when retrieving data.  They don’t just charge you for storage capacity.  They charge per-gigabyte rates for network usage and per-transaction fees to access your files.  When you’re dealing with big data, petabytes of storage, and massive file transfers, the public cloud is often the most costly approach.  Then there are data privacy and performance concerns.   Because you’re sharing system resources with other public cloud users, performance can be wildly inconsistent, and you may have little control over where your data sits.

Object Storage On-Premises to the Rescue

This is where object storage on-premises comes in.  Object storage is the same technology that enables the public cloud, and gives services providers an extremely cost-effective, and highly scalable environment. Object stores can scale to hundreds of petabytes in a single namespace without suffering any sort of performance degradation.  They minimize costs, using techniques like erasure coding to maximize usable disk space  Beyond that, object storage offers data management functionality that traditional storage approaches can’t begin to touch, like versioning, customizable metadata, and embedded analytics.

Object stores achieve their scalability by decoupling file management from the low-level block management.  Each disk is formatted with a standard local file system, like ext4.  Then a set of object storage services is layered on top of it, combining everything into a single, unified volume.  Files are stored as “objects” in the object store rather than files on a file system.  By offloading the low-level block management onto the local file systems, the object store only has to keep track of the high-level details.  This layer of separation keeps the file lookup tables at a manageable size, allowing you scale to hundreds of petabytes without experiencing degraded performance.

To maximize usable space, object stores use a technique called Erasure Coding to protect your data.  If you’re unfamiliar with erasure coding, you can think of it as the next generation of RAID.  In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk.  Additional shards are added, containing error correction information, which provide protection from data corruption and disk failures.  Only a subset of the shards is required to retrieve each file, which means it can survive multiple disk failures without the risk of data loss.  Erasure coded volumes can survive more disk failures than RAID and typically provides more than double the usable capacity of triple replication, making it the ideal choice for petabyte-scale storage.

Managing Unstructured Data

Whenever you’re working with bulk storage, the majority of your data is probably going to be unstructured. “Unstructured data” is a buzzword you hear a lot regarding object storage.  It doesn’t mean that the files themselves are unstructured.  Certainly, every file has some type of structure.  When somebody says that their data is unstructured, they mean that it’s not stored in a database.  It’s just a loose collection of files, created by a variety of different applications.  A good example is your documents folder.

Object stores help you manage unstructured data by letting you tag files with custom metadata that describe their contents.  This allows you to track and index files without the need for external software or databases.  All of your data is self-describing, which opens up an array of new possibilities for data analytics.  You can index and search your files without any prior knowledge of a file’s internal structure or what applications were used to create it.  By giving your files context, you can perform analytics directly on your data.

The ActiveScale™ Object Storage System

The ActiveScale system from Western Digital is a fully-featured, S3-compatible object storage system, designed from the ground up to handle today’s big data storage challenges.  It is designed for peta-scale deployments, and uses advanced data storage techniques, like erasure coding, to protect your data and maximize usable disk space.

From the components that make our hard drives reliable, to the expansion of capacity through our leading helium-sealed technology and all the way up the system stack, Western Digital innovates to deliver some of the densest storage systems on the planet. We are in the unique position of being the only vendor that manufactures nearly all of the components throughout the entire storage technology stack.  This enables us to offer some of the most aggressive pricing in the industry and lead the charge to help businesses transform themselves.

If you’d like to learn more about the ActiveScale system, visit our website here.

Object Storage Resources

Now that you know why object storage, learn more about how it can be used to solve your data storage problems. Here are some recommended reads:

[1] The Growing Role of Object Storage in Solving Unstructured Data Challenges, 451 Research survey, in cooperation with HGST, 2017.

Author

Mike McWhorter: Mike is a Senior Technologist for Western Digital and specializes in performance tuning and storage optimization for big data customers.

Related Posts

Big Data Solutions
Published on April 9, 2020

Learn about storage architectures that are solutions for the challenges of data at scale, including Ultrastar hard drives and OpenFlex [...]

Examples of Unstructured Data ⁠— 4 Enterprise Use Cases
Published on September 16, 2019

Examples of unstructured data include large media files, massive databases, data lakes, and architectural information as well as billions [...]


5 comments on “Why Object Storage? A Short, Definitive Explanation

  1. Jake on

    I’m new to studying data services, but isn’t S3 public cloud? Why would S3-compatible object storage be necessary if you are saying that public cloud is expensive? Is it just for backup?

  2. Mike McWhorter on

    Hi Jake. Great question! When people talk about S3, they’re often referring to Amazon S3, which is a public cloud-based storage service on AWS. When Amazon created the S3 service back in 2006, they did a really good job of designing the API, and when they published the spec, it was adopted as the de-facto standard interface for object storage. So S3 is also used to refer to the S3 API. Today, there are many alternatives to Amazon S3, both on-prem and in the cloud, and nearly all of them support the S3 API. I hope that clears things up. Thanks for reading the blog!

  3. Jools on

    Thank you for the informative blog! However, you use the concepts ecosystem and supply chain rather interchangeably, except under point six where you say “Data plays a key role in this ecosystem and its supply chain”. I was wondering what in your opinion is the difference between an ecosystem and a supply chain in this context? All answers are welcome!

  4. Brielle Perez on

    Thank you so much! Most file systems have restrictions on the number of files, directories and levels of hierarchy that can be supported — which limits the amount of data that can be stored.

  5. Dee on

    “Files are stored as “objects” in the object store rather than files on a file system” – Can you please explain how does this translate to actual storage? How is storage different for an object vs a file? Ultimately both have to be stored on sector on a disk. What does this layer of abstraction does that a file system can’t?

Comments are closed.