March 30, 2018 6 min read Data

Deduplication in the Cloud – Getting the Most of Hybrid Cloud Workflows

The impact of deduplication on traditional storage efficiency was significant. Now deduplication is expanding to deliver those benefits to cloud environments. In this blog post I’ll look at how our new partnership is bringing the power of deduplication to hybrid and private cloud infrastructure. Let’s begin with the fundamentals and establish some definitions.

Introduction to Data Deduplication

One of the first entries when searching for deduplication gives you the Wikipedia definition. For the sake of brevity, it is a process that is also called, “Single Instance Storage,” whereby a single instance of duplicate data is actually stored persistently. The deduplication process can be executed, 1. In-line, when the data enters a target device, or 2. Post-process, where the data is written to the target device and ‘processing’ and removal of the duplicate data is performed at a later time.

In-line deduplication uses hash calculations and look ups, which can impact performance. The benefit of In-line deduplication is a lower amount of storage required at the target device. And Post-processed deduplication does not impact the ingest speed, resulting in fast ingest rates. However, Post-process does generally require larger target storage capacity. The two solutions are often pitted against each other in heated debate.

[Tweet “Deduplication + hybrid #cloud —> Better. Faster. Lower costs. #backup”]

The Cloud, Put it All in the Cloud?

New technologies in the data center are helping to alleviate challenges and allowing businesses to harness the value of data. Much of IT dollars are being spent on storage systems. The onslaught of data being stored, accessed, analyzed and transformed, are driving new solutions of cost effective dense storage systems. But, for people responsible for those systems it can become challenging. The more data that is stored, the more systems consume power, cooling and floor space. And the staff required to manage a growing amount of hardware can tax most organizations that are trying to operate, “lean and mean.” That’s ok, many say, just move it to the public cloud and move those environmental requirements and staffing complexities to a third party. Cloud, here comes my data…?

Where Did My Hair / Data Go

My first computer used a MFM encoded 20MB hard drive. And I thought I was hot stuff! Now we have 14TB hard drives, and I realize that I have been doing this for a long time and that I had a full head of hair back then. Fast forward through the technology and you see multiple systems, multiple drives, RAID solutions, and massive capacity in a single disk drive. A single large capacity drive failure was no longer a case of grabbing your floppy disk or mini-tape backup and an IV of caffeine. RAID-based storage systems soon suffered from rebuild times associated with the large capacity disk drives and risked massive data loss resulting for some in lost business and could potentially bring a cascade of problems including lawsuits and litigation costs.

On Premises Tradeoffs and Benefits

A business or enterprise’s data is part of their intellectual property. Whether that data is stored on premises or in the cloud, there are still data management and cost considerations. Simply storing your data in the public cloud does not relieve a customer of those cost issues. If you want to access your data in the public cloud, it comes with a cost. If you want better and faster performance, it comes with an even higher cost.

Enterprises are often governed by internal policy as well as industry and governmental requirements. And not all data gets a trip to the public cloud(s). I am not going to cover the various rules and regulations associated with data storage, those who must abide to these policies are all too aware of their implications…

Then Came Object Storage

Traditional solutions could not maintain pace with the explosion in data storage. The amount of data being stored is expanding globally. Worldwide data generation is expected to reach 163 Zettabytes by 2025. Object Storage systems provide a highly scalable solution outfitted with significant data protection mechanisms inherent in the hardware, software and protocol. Erasure Coding shards (distributes) pieces of the data across many devices in a single name space; see it as RAID on “performance enhancing drugs.”

The ActiveScale Object Storage System

Western Digital’s ActiveScale Object Storage systems benefit from being the single vendor manufacturing through the entire technology stack. This not only allows us to innovate at every step, taking advantage of our deep expertise in manufacturing, but it also enables us to offer some of the most aggressive pricing in the industry.

ActiveScale is designed to scale up and out in capacity and throughput. Scaling to petabyte-scale while providing extreme data integrity and data loss prevention are integral to our customers. ActiveScale facilitates data-forever, easily keeping up with data growth, and delivering on long-term preservation goals with simplified management.

Deduplication in the Cloud

Now let’s get back to deduplication. The impact of deduplication in traditional storage was significant. Enterprises achieved more longevity on their storage systems from the decreased amount of redundant data being written. And the technology had a cascade affect by reducing the overall number of storage systems in the data center. This has a great impact on stabilizing or even decreasing power and cooling costs, delivering reduced operating expenses. The industry is making incredible efficiency improvements to minimize their costs and environmental impacts.

Now, Western Digital and StorReduce bring the power of deduplication to object-storage based private and hybrid cloud infrastructure.

A Powerful Hybrid Cloud Solution

StorReduce sits between your applications and the public cloud or ActiveScale object storage, transparently performing inline data deduplication on up to hundreds of petabytes under a single global namespace using the Amazon Simple Storage Service™ (S3) REST API.

What does that really mean? ActiveScale and StorReduce present a fast and massively scalable target for deduplicated data which can be replicated between on premise data centers and public cloud-based providers such as AWS™. Duplicate data removal means less traffic across leased lines and the ability to extend usable capacity beyond its physical limits.

ActiveScale together with StorReduce present a modern cloud-based data storage solution and a powerful, cost-effective data protection strategy. With deduplication in a cloud configuration using ActiveScale and StorReduce, businesses can see savings of up to 70% or more for their primary backup environment than with backup appliances.

We all want it better, faster and for less money. This partnership delivers on those requirements to help your data thrive.

Learn more:

Blog: Why Object Storage? If you’re new to this technology, this is a great intro to what it’s all about.
On-demand webinar: How to Solve Backup Appliance Sprawl