By Shailendra Tripathi, Fellow, Filesystem Development Engineering, and Ganesh Balabharathi, Sr. Technologist, Performance Engineering
Our IntelliFlash™ N-Series NVMe™ storage array leverages NVDIMM NTB-based mirroring for high availability (HA). This is an important distinction from some of the other solutions currently available on the market. In this blog I’ll explain the technical reasoning behind our architectural approach and how it delivers superior write throughout and space efficiency.
NVDIMM in Journaling Operations
Journaling is the common method in storage systems to achieve crash consistency. Any persistent media can be used for journaling, and NVDIMM is a popular choice in many storage arrays due to its low-latency access.
Update operations are first persisted in the NVDIMM and subsequently applied to the actual storage media asynchronously. As such, the performance of the storage system, in particular the journaling update operations are associated with the NVDIMM performance.
To achieve crash consistency, highly-available systems need to have access to the journal even when one node dies in the system. This allows the second node to bring up the storage system without any loss of data integrity or consistency. In highly available systems, all the committed and client acknowledged operations persisted in the journal are applied during replay or recovery. Hence, the journal should be available to both nodes all the time.
Shared Device vs. Mirroring
There are several approaches to journaling in a highly-available system. The journal may be a shared device that is a dual-ported device accessible from both nodes. This shared device can be either a low-latency flash device or NVDIMM. Alternatively, the journal may be owned by each node independently, and each operation is mirrored to the other node before the response is acknowledged back to the client.
For our IntelliFlash N-Series products, our extreme performance NVMe arrays, we chose the latter model where the NVDIMMs are mirrored to the other node via (Non-Transparent Bridge) NTB. The NVDIMM usage and update I/O flow is described in the image below. The red arrow originating in NFS protocol traces the I/O path journey before it is acknowledged to the client.
Why NVDIMM in a NTB Model?
One obvious question may arise as to why choose NVDIMM in an NTB model when NVDIMMs could be placed in drive slots in the storage bay in a dual ported way? This could potentially eliminate the NTB complexity involved in persisting the operations. The answer lies in current hardware capabilities and improving the overall efficiency of the system.
The Drive Slot Math
The underlying platform of the IntelliFlash N-Series has 48 PCIe 3 lanes. If we look at the option of using dual ported drives, we see that each drive is connected to 2 lanes per controller (4 total per drive). Each PCIe Gen 3 lane can provide a theoretical maximum of 985 MB/s or ~ 1 GB/s (PCIe 3.0 single lane can do 8 GT/s and uses 128b/130b encoding). Hence, each drive can provide at most ~2 GB/s. When NVDIMM is used as a device, the data must be mirrored so that consistency can be achieved even with a drive failure. That means that for every ~2 GB/s performance, 2 drive slots are required.
Now let’s compare the NTB option. The current shipping N-Series system has NTB over PCIe Gen3 x8 lanes. This allows for a theoretical maximum of ~8 GB/s (8 *985MB) to be achieved in a single direction. To compare, achieving the equivalent performance using dual ported drives, 8 drive slots will be required. In a typical 24 drive-2U NVMe system, this would result in 1/3 less storage capacity available.
Furthermore, application level (like Oracle) performance achieved over NTB is about 5.1 GB/s one way and 8 GB/s in an active/active model from the N-Series array. Hence, even for similar performance levels, at least 6 NVDIMM drive slots would be required. Moreover, the PCI x 8 lanes for NTB is the limitation for current generation N-Series systems. This can be further extended with PCI x16 lanes. That will raise the maximum performance profile to ~16GB/s and if it is linearly extrapolated, the application level performance numbers may be up to ~10 GB/s. In other words, it would require 10-16 drive slots to achieve a similar level of performance.
Looking at Future Architectures
With PCIe Gen4 based systems arriving, the scenario may become better as each lane drive may see up to ~2GB/s of performance. However, it will still not be very space efficient model. Besides, the NTB lanes also likely similar upgrade. Hence, NVDIMM NTB-Based Mirroring will continue to be a better choice. Despite its additional complexity, NTB-based mirroring model provides HA storage systems with superior space efficiency and performance, due to the underlying hardware limitations. A better architecture for the foreseeable future.
Shailendra has over 18 years of experience in file and storage systems development. He leads storage system file system development for WDC.