How Engineers Turned 10,000 Devices into an Analytics Goldmine

How Engineers Turned 10,000 Devices into an Analytics Goldmine

Developing new flash memory storage devices is a whirlwind of fast coding sprints and constant testing. System engineers like Alexandra Bauche, senior manager of systems design engineering at Western Digital, sit at the heart of this fast-paced storm, at the place where hardware and software come together. 

Bauche has always been interested in how things interact. How components like NAND flash memory chips, a controller, and firmware impact one another and how they can be optimized within an SSD. 

But in 2018, while working in validation to test new firmware sprints, she found the scope of her work very limiting. “I wasn’t interested if the latest code passed or failed my test. I wanted to understand what was happening in the system,” she said. 

One particular testing function caught her attention. It allowed her to see all data streaming from a device in real-time. Bauche thought that if she could collect this data and analyze it, she could uncover other processes happening within the system. 

By mid-2018, she left the validation team to lead a new effort to explore these drive data streams.

Big data woes

At first, Bauche wrote her own Python scripts to collect, parse, and analyze the data. But after a few months of experimenting, she realized she would need much more data and solid processing infrastructure to get the insights she was after. 

With her manager’s support, she turned to Anshuman Singh, the director of software development engineering at Western Digital. 

Singh leads a team of 50 engineers that build applications and data tools to power product engineering. Bauche’s request seemed like a straightforward ask for his seasoned engineers. But getting to the insights Bauche was after proved far more challenging. 

Western Digital’s product engineering operations are massive. There can be 10,000 devices under simultaneous testing, each pumping huge amounts of data across several global locations.

Collage of data graphs, NVMe SSD, and engineers

Some of the data needed to travel nearly 10,000 miles to a central repository. Make that a round trip for thousands of simultaneous operations, and latency was piling up. 

But there were other perplexing problems. Singh’s team was using Western Digital’s fastest NVMe SSDs, yet systems couldn’t take advantage of the devices’ data speeds. 

Singh had to constantly add more servers and keep those NVMe drives half empty, so processing wouldn’t ebb. The project was on an expensive trajectory with lackluster results.  

A novel approach  

Singh didn’t know it at the time, but he was up against a much more daunting problem: the end of Moore’s Law (or the slowing of CPU performance).

It turned out that when running databases, a CPU doesn’t just crunch data. It also needs to do a lot of computation associated with how data is laid out on a storage device.

When Singh combined a multi-petabyte database with the speed NVMe SSDs read and write data, the CPU was so busy with storage computation it had little muscle left for analytics. 

Fortunately, Singh is an engineer with an insatiable curiosity. Around the same time, he stumbled upon a startup called Pliops while attending a Western Digital Capital event in San Jose. 

Pliops developed a specialized processor called the Extreme Data Processor (XDP). The device isn’t meant to replace the CPU. Instead, it complements it by offloading and accelerating storage engine functions (those functions responsible for how data is laid out on a storage device). 

“The CPU had more bytes than it could chew…”

Essentially, by using specialized hardware and novel data shaping algorithms, the XDP becomes the system’s “storage brain,” freeing the CPU to do more analytics computation.

Singh immediately recognized the potential. By September 2020, he met with Ofer Frishman, the senior director of architecture at Pliops, and introduced him to the project.

Edge that data

Frishman was impressed by the scale of data driving Western Digital’s product innovation. But he also saw Singh’s performance issues as emblematic of a widespread problem in the industry. 

“Large data volumes like Western Digital’s have exposed the inefficiency of compute in legacy architecture,” said Frishman. “Moore’s Law has slowed dramatically. The CPU is overloaded, while NVMe SSDs are underutilized. This problem will only get worse as SSDs become faster, and this is exactly what we aim to solve,” he said. 

For the next few months, Singh and Frishman worked on building an edge architecture, powered by Pliops XDP, to process the streaming data in Western Digital product labs. If all the data didn’t need to squeeze through the network, engineers could potentially access more of it faster.

For Singh, the prospects were clear. But changing data architectures is a tricky operation. Product engineering is fast-moving and rigorous. Any data downtime would be detrimental, and there’s little room for error. 

It started with a hunch

By April 2021, a year after discovering Pliops, Singh concluded a meticulous evaluation. Pliops’s XDP was a perfect fit for the edge approach and put into production. 

With the XDP unleashing CPU and NVMe storage resources, Bauche and other product engineers could suddenly crunch months of data rather than just a week’s worth. It was the freeway to innovation Bauche needed to uncover those hidden patterns and trends. 

“We weren’t just able to uncover anomalies we didn’t even know existed,” said Bauche about the project’s success. “We could do so much earlier in the process, already in the very first cycles.” 

Nearly four years since those first Python scripts, Bauche’s hunch evolved into data prowess. Her team’s system insights, powered by Singh’s robust platform and Pliops’s trailblazing XDP, have been instrumental in making Western Digital’s flash memory devices more reliable and keeping the company’s product engineering on the cutting edge.

Related Stories

RAID Storage, Explained.