If you do an internet search on object storage, you’ll find plenty of explanations on what object storage is, but the one thing no one tells you is why you would want to use it in the first place. If you’re a systems engineer, that’s not particularly helpful. When I first heard about object storage, I spent hours researching it on the internet, trying to figure out what problems it was meant to solve. Yes, it stores files as objects instead of file system blocks, but so what? If you expect me to abandon my existing storage solution for an object store, you’re going to have to give me a pretty compelling reason. So what is it? If you study the problem for long enough, you’ll eventually be able to piece it together, but I think I can save you some time.
The main problem object storage is trying to solve is scalability. We are experiencing an explosive growth in data right now. In 2017, over half of the IT organizations we surveyed said that they have storage capacities of 50PB or more. The rate at which we generate data is constantly increasing, which means that this problem is only going to get worse as time goes on. If you’re one of the engineers who’s responsible for storing and protecting all of that data, this is a big problem because conventional storage techniques just weren’t designed to handle this kind of scale. Let me explain…
Traditional Storage Approaches
The traditional approach is to use SAN and NAS systems for storage. This works well for smaller data sets, but it falls apart when you attempt to scale it. The problem is the file system. Traditional block-based file systems use lookup tables to store file locations. They break each file up into small blocks, generally 4k in size, and store the byte offset of each block in a large table. This is fine for small volumes, but when you attempt to scale to the petabyte range, these lookup tables become extremely large. It’s like a database. The more rows you insert, the slower your queries run. Eventually your performance degrades to the point where your file system becomes unusable. When this happens, users are forced to split their data sets up into multiple LUNs to maintain an acceptable level of performance. This adds complexity and makes these systems difficult to manage.
What About Scale-Out File Systems?
To solve this problem, some organizations are deploying scale-out file systems, like HDFS. This fixes the scalability problem, but keeping these systems up and running is a labor-intensive process. Scale-out file systems are complex and require constant maintenance. In addition, most of them rely on replication to protect your data. The standard configuration is triple-replication, where you store 3 copies of every file. This requires an extra 200% of raw disk capacity for overhead! Everyone thinks that they’re saving money by using commodity drives, but by the time you store three full copies of your data set, the cost savings disappears. When we’re talking about petabyte-scale applications, this is an expensive approach.
The Public Cloud
To minimize the complexity, many organizations are using public cloud storage. The pay-as-you-go model offered by hosting providers makes sense for smaller data sets, but can be extremely expensive when you attempt to deploy it at scale. The public cloud can come with high egress fees when retrieving data. They don’t just charge you for storage capacity. They charge per-gigabyte rates for network usage and per-transaction fees to access your files. When you’re dealing with big data, petabytes of storage, and massive file transfers, the public cloud is often the most costly approach. Then there are data privacy and performance concerns. Because you’re sharing system resources with other public cloud users, performance can be wildly inconsistent, and you may have little control over where your data sits.
Object Storage On-Premises to the Rescue
This is where object storage on-premises comes in. Object storage is the same technology that enables the public cloud, and gives services providers an extremely cost-effective, and highly scalable environment. Object stores can scale to hundreds of petabytes in a single namespace without suffering any sort of performance degradation. They minimize costs, using techniques like erasure coding to maximize usable disk space Beyond that, object storage offers data management functionality that traditional storage approaches can’t begin to touch, like versioning, customizable metadata, and embedded analytics.
Object stores achieve their scalability by decoupling file management from the low-level block management. Each disk is formatted with a standard local file system, like ext4. Then a set of object storage services is layered on top of it, combining everything into a single, unified volume. Files are stored as “objects” in the object store rather than files on a file system. By offloading the low-level block management onto the local file systems, the object store only has to keep track of the high-level details. This layer of separation keeps the file lookup tables at a manageable size, allowing you scale to hundreds of petabytes without experiencing degraded performance.
To maximize usable space, object stores use a technique called Erasure Coding to protect your data. If you’re unfamiliar with erasure coding, you can think of it as the next generation of RAID. In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk. Additional shards are added, containing error correction information, which provide protection from data corruption and disk failures. Only a subset of the shards is required to retrieve each file, which means it can survive multiple disk failures without the risk of data loss. Erasure coded volumes can survive more disk failures than RAID and typically provides more than double the usable capacity of triple replication, making it the ideal choice for petabyte-scale storage.
Managing Unstructured Data
Whenever you’re working with bulk storage, the majority of your data is probably going to be unstructured. “Unstructured data” is a buzzword you hear a lot regarding object storage. It doesn’t mean that the files themselves are unstructured. Certainly, every file has some type of structure. When somebody says that their data is unstructured, they mean that it’s not stored in a database. It’s just a loose collection of files, created by a variety of different applications. A good example is your documents folder.
Object stores help you manage unstructured data by letting you tag files with custom metadata that describe their contents. This allows you to track and index files without the need for external software or databases. All of your data is self-describing, which opens up an array of new possibilities for data analytics. You can index and search your files without any prior knowledge of a file’s internal structure or what applications were used to create it. By giving your files context, you can perform analytics directly on your data.
The ActiveScale™ Object Storage System
The ActiveScale system from Western Digital is a fully-featured, S3-compatible object storage system, designed from the ground up to handle today’s big data storage challenges. It is designed for peta-scale deployments, and uses advanced data storage techniques, like erasure coding, to protect your data and maximize usable disk space.
From the components that make our hard drives reliable, to the expansion of capacity through our leading helium-sealed technology and all the way up the system stack, Western Digital innovates to deliver some of the densest storage systems on the planet. We are in the unique position of being the only vendor that manufactures nearly all of the components throughout the entire storage technology stack. This enables us to offer some of the most aggressive pricing in the industry and lead the charge to help businesses transform themselves.
If you’d like to learn more about the ActiveScale system, visit our website here.
Object Storage Resources
Now that you know why object storage, learn more about how it can be used to solve your data storage problems. Here are some recommended reads:
Mike McWhorter is a Senior Technologist for Western Digital. He specializes in performance tuning and storage optimization for Western Digital's big data customers. He is involved in testing and benchmarking new applications as well as optimizing them for various types of storage technologies. Previously, Mike worked as a Solutions Architect for the federal sales team, where he was responsible for designing and implementing large-scale distributed systems for the federal government. Mike received a Bachelor’s degree in Computer Science from Longwood University in Virginia.