SQL Server OLTP Consolidation – How to Get the Right S2D Caching Tier

SQL Server OLTP Consolidation – How to Get the Right S2D Caching Tier

In one of my previous posts, Test Results for an All-Flash Spaces Direct Configuration, I shared test results from a non-production Storage Spaces Direct configuration and discussed implications for the storage configuration to host virtualized workloads.

Now let’s look at SQL Server OLTP consolidation and some considerations when sizing a production-capable Storage Spaces Direct configuration that reflects Microsoft’s recommendations.

Caution! This blog post is a thought exercise, pulling together information from a variety of sources. Performance is projected based on datasheets. We have not tested this configuration to validate the projected performance levels.

SQL Server OLTP Consolidation

Consolidating several or many SQL Server workloads onto fewer hosts has been going on for years, with workload density per host increasing as servers and storage have become more capable.

Imagine the next wave of SQL Server OLTP consolidation in your company, to run more workloads on fewer hosts than you use today. That’s easy to imagine today, with SQL Server 2008/R2 reaching End of Support on July 9, 2019 – a large percentage of SQL Server deployments run SQL Server 2008, and something will have to be done with them.

From experience you know that when you add more virtualized applications to the server, the server quickly hits a storage performance bottleneck that effectively limits the number of workloads that server can support. This is caused by the I/O Blender Effect which I wrote about in an earlier post of this series.

How to plan and deploy the Storage Spaces Direct storage environment for a successful SQL Server OLTP consolidation project is the focus of this blog post.

Let’s analyze a production-ready configuration that follow’s Microsoft’s guidance, instead of the test configuration described in my previous blog post. I’ll use Ultrastar® SSDs as examples so we can look at specific device characteristics and consider trade-offs. (All figures are from Ultrastar datasheets when this blog post was published.)

HGST Ultrastar SN200 and Ultrastar SS300

The configuration we’ll discuss:

  • 4-node Storage Spaces Direct cluster, hyperconverged configuration
  • Caching: two 800GB[1] Ultrastar SN200 NVMe™ SSDs per node
    • 5” U.2, MLC NAND, 3 DW/D
  • Capacity: four 960GB Ultrastar SS300 SAS SSDs per node
    • 5” U.2, TLC NAND, 1 DW/D

The “hyperconverged” configuration means the workloads run on the same servers that implement Storage Spaces Direct, a very popular configuration.

Storage Spaces Direct Caching Tier

Microsoft requires at least two cache drives per node for resilience, so if one fails, the other can carry the load until a replacement drive is installed. Microsoft’s guidance is to add capacity drives in multiples of the number of cache tier drives, so the minimum is four capacity drives.

Across our 4-node cluster, the eight NVMe SSDs can deliver up to a combined 1,600,000 4KB write IOPS, and even using the smallest-capacity 800GB drive, we get 6.4TB of cache tier capacity.

Absorbing OLTP’s Small-Block Random Writes

Each OLTP workload brings its share of small-block random writes. Consolidating many OLTP workloads will bring many small-block random writes, so you’ll want NVMe SSDs to absorb all those writes.

The total volume of writes (number of writes * size of writes) may not be particularly large, so you probably won’t need maximum-capacity NVMe SSDs; but you’ll want to maximize the write bandwidth available, to maximize the number of workloads you can support.

To get more bandwidth, you can add more SSDs – each SSD has a controller, and the performance of the controller can be one of the limiting factor for bandwidth. Getting more SSDs will give you more controllers and increase bandwidth.

Using more SSDs – but at lower capacity points – can give you more controllers and bandwidth, as shown in the table below.

QtyProductTotal CapacityCombined Max Bandwidth
2Ultrastar SN200 NVMe SSD 1.6TB3.2TB4,200 MB/s (Seq W, 128kB)
4Ultrastar SN200 NVMe SSD 800GB3.2TB8,400 MB/s (Seq W, 128kB)

More SSDs at lower capacity points can give you more bandwidth

You can see how additional lower-capacity drives – thus more controllers – is how you’d adjust this configuration to provide more bandwidth, if you needed it.

Comparing the cost with a national online seller, the four smaller drives cost about +28% more than the two larger drives, but you get 2X the max bandwidth.

Understand Your Working Set and Caching Tier Sizing

Before we order those NVMe drives, let’s consider the “working set”. Each workload has a set of active data that is frequently accessed, and that workload’s performance depends on whether that active data can be accessed quickly.

Optimally, SQL Server would have enough system memory available to keep your entire database in memory for the highest possible performance. When consolidating SQL Server workloads, that’s unlikely to be your situation. Instead, SQL Server will keep as many active pages from as many database workloads in system memory as it can; the remainder of the active pages be “paged in” when needed, from wherever they’re stored.

But where is the database stored? In our all-flash configuration, the database is stored (“persisted”) on the capacity tier drives; except for recently-written data, which you’ll recall goes first to the caching tier drives before being destaged to the capacity tier.

For a single database, consider the portion of that workload’s active data that’s been recently modified. The modified data has been written to the NVMe drives of the caching tier, and at some point Storage Spaces Direct will destage it to the SAS SSDs.

Before destaging occurs, the workload may need SQL Server to read that data back into system memory; after all, we’re talking about the workload’s most active data. Before destaging, that data can be read from the NVMe cache tier devices, which are over 50% faster (bandwidth) than the SAS SSDs in the capacity tier.

To support the many OLTP workloads you want to consolidate, for best results your system needs to provide enough NVMe drive capacity in your caching tier to hold the most active data of all those workloads.

Let’s do some math. Assume we’ll consolidate ten workloads, each workload has a 400GB database, that’s 4,000GB total. Twenty percent is a typical working set, 20% of 4,000GB is 800GB.

WorkloadsDatabase Size (each)Database Size (total)Working Set (20%)
10400GB4,000GB800GB

Workloads vs. Working Set Size

Our Storage Spaces Direct cluster has 6.4TB of NVMe cache, but our workloads need only 800GB – what do we do?

Easy Math: Consolidate More Workloads, Save More Money!

A couple years ago, a SanDisk whitepaper demonstrated that storage performance was typically the bottleneck that limited how many workloads a host can support. That whitepaper also showed that using fast flash storage in the host not only allowed you to run up to 3x more virtualized workloads on the same host, those workloads would all run 2x faster.

SQL Server and Windows Server are now licensed per-core, so the more workloads you can run on each host, the less you pay for licenses. You minimize what you pay for software licenses when you maximize the work done by each host.

That whitepaper demonstrated that investing in fast flash storage enabled the same host to run 3x more workloads. Three times more workloads per host means you can avoid buying two additional hosts to run those workloads – that can save several $100Ks in software licensing cost, with a 22x ROI.

Returning to our caching tier sizing discussion, let’s increase the number of workloads to be consolidated by 3x, as the SanDisk study showed. That means we’ll consolidate thirty OLTP workloads instead of ten. The total database size will be 12,000GB (400GB * 30 = 12,000GB), and a 20% working set will be 2,400GB or 2.4TB.

WorkloadsDatabase Size (each)Database Size (total)Working Set (20%)
30400GB12,000GB2,400GB
50400GB20,000GB4,000GB
80400GB32,000GB6,400GB

More Workloads vs. Working Set Size

With 6.4TB of cache tier across the 4-node cluster, that leaves plenty of cache capacity to consolidate even more workloads, or workloads with substantially larger working sets

Caution! This blog post is a thought exercise, we have not tested this configuration to validate the projected performance levels.

Up Next – the Capacity Tier

In my next blog look at what you need to know about the Storage Spaces Direct capacity tier, destaging and reserve capacity, to make your SQL Server OLTP Consolidation a success. Here are some helpful resources you can take advantage of in the meantime:

Helpful Resources:

Thanks to Peter Plamondon for all of his wonderful work and his many contributions to this and other articles.

[1] One megabyte (MB) is equal to one million bytes, one gigabyte (GB) is equal to 1,000MB (one billion bytes) and one terabyte (TB) is equal to 1,000GB (one trillion bytes) when referring to solid-state capacity. Accessible capacity will vary from the stated capacity due to formatting and partitioning of the drive, operating software, and other factors.

Related Stories