NOTE: This blog was updated on September 26, 2018 by Earle Philhower III to reflect the latest advancements in Western Digital’s SSD technology.
This is my second post in the “Speeds, Feeds, and Needs” blog series, designed to explain the more technical elements of enterprise storage in terms that are understandable to everyone. My first post discussed the role of latency in storage architectures. In this post I’ll discuss SSD endurance and how this affects your choice of SSD, plus give you some rules of thumb for making the right choice.
Choosing the Right SSD Isn’t Easy
You’ve probably looked at an SSD datasheet and been a little overwhelmed. Choosing the right SSD is a complicated process, after all. You have to pick the right the form factor so that the drive will fit in your server. You need to select from three major, incompatible interfaces (SATA,SAS, or NVMe™). You also need to choose the right capacity, of course, anywhere from 100s of gigabytes to multiple-terabytes. That’s all you need to do, right? Wrong.
There’s one more choice you need to make, and it’s a choice you might not have had to make before: the SSD endurance level. SSD endurance is the total amount of data that an SSD is guaranteed to be able to write under warranty, often specified in “TBW” or “DWPD” (which we’ll discuss a little later). The physics of SSD endurance are complicated, but the results are simple: SSDs wear out as you write to them. Choose the wrong SSD endurance and you’ll end up replacing the drive early or overpaying for a higher endurance drive than needed.
Not All Flash Is Created Equal
SSD endurance is limited because the NAND flash that powers SSDs has a finite number of “program/erase” (P/E) cycles before it can’t be used anymore. These cycles occur whenever existing data needs to be overwritten in a flash cell. As the industry transitions from Multi Level Cell (MLC) to Triple Level Cell (TLC) SSDs, which store 3 bits per cell, the available P/E cycles decrease. This decrease in cycles is obviously a bad thing for endurance.
Error Correction, Overprovisioning, and Firmware
Thankfully, SSD endurance isn’t set by P/E cycle limits alone. Technology placed around the NAND by the manufacturer can change endurance as well, for better or worse. Western Digital improves SSD endurance with three main technologies: error correction, overprovisioning, and firmware.
Advanced error correction techniques such as HGST’s CellCare™ NAND management technology or SanDisk®’s Guardian Technology™ can help retrieve data from even marginal flash cells and can dramatically extend the NAND cell’s usable lifetime.
Overprovisioning adds additional flash capacity to the SSD. This additional flash is not visible to the user, but it is visible to the drive and used to enhance endurance by allowing for more efficient data management.
Finally, the program that runs in the SSD, the firmware, can intelligently manage the flash inside of the SSD. The more experience a company has with end-user workloads and the flash, itself, (SanDisk, a Western Digital brand, has over thirty years history in this!) the more intelligence it can embed in this firmware to help maximize endurance.
The SSD Endurance Equation
SSD endurance is commonly described in terms of Drive Writes Per Day (DWPD) for a certain warranty period (typically 3 or 5 years). In other words, if a 1TB SSD is specified for 1 DWPD, it can withstand 1TB of data written to it every day for the warranty period. Alternatively, if a 1TB SSD is specified for 10 DWPD, it can withstand 10TB of data written to it every day for the warranty period.
Another metric used for SSD write endurance is Terabytes Written (TBW), which describes how much data can be written to the SSD over the life of the drive.
Converting between TBW and DWPD is simple:
DWPD to TBW: TBW = Capacity(TB) * DWPD * 365 * Warranty(Years)
TBW to DWPD: DWPD = TBW / (365 * Warranty(Years) * Capacity(TB) )
“1 DWPD” Doesn’t Equal “1 DWPD”
A common trap that users fall into when looking at SSD datasheets is assuming that “1 DWPD” on one drive means the same as “1 DWPD” on another drive. When SSDs have different capacities, the total amount of data you can write to them can vary dramatically. Take the case of a 15TB, “1 DWPD” SSD and a 1TB, “1 DWPD” SSD, both with a 5-year warranty.
TBW(15TB) = 15TB * 1 DWPD * 365 Days/Year * 5 Years = 27,375 TBW
TBW(1TB) = 1TB * 1 DWPD * 365 Days/Year * 5 Years = 1,825 TBW
What Happens When You Get SSD Endurance Wrong
Choosing too high of an endurance SSD can often increase the initial cost. However, in some cases a higher-endurance SSD can provide higher write performance than a lower endurance SSD. So, if your application can take advantage of additional SSD performance you may want to consider looking at a higher-endurance model.
Choosing too low of an endurance requirement, however, can increase your cost and troubles in the long run. As the total amount of written data goes beyond the warranty endurance, the possibility of data loss and SSD failure increases. The costs and frustration of replacing failed drives or dealing with lost data can add up quickly.
Getting it Right
When you know how much data your application will be writing the choice of endurance level is straightforward: Determine the average amount of data written per day, multiply it by the number of days a server is in service, and then use that number as a lower bound endurance limit. This average number is a lower bound because since it’s prudent to add headroom for unexpected growth.
The SSD Endurance Cheat Sheet
When workload measurements aren’t readily available, there are some rules of thumb that can be used instead. The following table contains a list of use cases and a generalized range of DWPD, adapted from Top Considerations for Enterprise SSDs.
Because these are rules-of-thumb only (from conversations with our customers and product teams), they should be used simply as guides to begin conversations with your vendor when choosing an SSD for your own application.
|Use Case||Description||Approx. DWPD|
|Boot Drive||Server boot drive. Updated only periodically. Logs and all permanent data stored elsewhere.||0.1 ~ 1.0|
|Content Distribution||Accelerating CDN front ends. Media migrated depending upon popularity.||0.5 ~ 2.0|
|Surveillance||Streaming writes from multiple cameras, operating continually, overwriting the drive on a periodic basis.||Cams * BW|
|Virtualization and Containers||Tier-0 storage for containers and VMs in a hyperconverged system. SSDs provide all local storage for the cluster.||1.0 ~ 3.0|
|OLTP Database||Data intensive workloads. Frequent updates to database logs and data files, often thousands of times per second.||3.0+|
|High Performance Caching||Accelerate local hard drives. Some of the highest write workloads possible.||3.0++|
Selecting the correct SSD for your application requires choosing the appropriate endurance, especially with today’s newer flash technologies. Taking the time to examine the data sheets and your workloads to select the right endurance for your SSD will maximize its lifetime and minimize your purchase costs and operating expenses.