September 6, 2024 15 min read AI

How AI is Reshaping the World’s Data Systems

It seems every corner of our lives is starting to feel the “AI effect.” Virtually no industry is left untouched by AI’s growing influence. Yet, in one domain, AI’s irreversible impact is increasingly evident: the data infrastructure that fuels it. Western Digital talked with four experts working at the heart of AI’s hardware and software […]

Western Digital talked with four experts working at the heart of AI’s hardware and software revolution to understand AI’s impact on the future of data systems. In four interviews, they shared how AI is rewriting the rules of the data center.

Suddenly, AI was everywhere

For the last decade, AI has been largely imperceptible, concealed in practical technologies like navigation systems, recommendation engines, and industrial automation.

The introduction of ChatGPT—hailed as the fastest-growing consumer application in history—brought AI into our hands and into the spotlight. The sensation of turning any idea into an image by simply typing its description—e.g., a sea otter in the style of Girl with a Pearl Earring—was breathtaking. But before we even caught our breath, complicated tasks—coding, photo editing, predicting molecular structures—seemed to become as effortless as ordering a latte.

Certainly, AI is still not flawless and many have highlighted its potential risks, but it is no less a Promethean Moment, one where nothing will be the same after.

“We’re really seeing the transition from explicit programming to implicit programming, where you don’t necessarily need to be a computing expert to make something profoundly capable,” Robert Ober, chief platform architect of NVIDIA’s data center products, said in an interview.

Ober, who has a background in processor architecture, joined NVIDIA nearly a decade ago. He saw what deep learning could do and believed it would change computer architecture forever. Yet even he admits that he dramatically underestimated AI’s impact.

“If earlier AIs were about identifying or recommending, what we’ve pivoted to very quietly is synthesis,” he said. “We’re able to create things—texts, mechanical design, new drugs, even processor architecture. We are in the baby steps of a whole new industrial revolution. Any place where there’s a processor, there will be AI. Everything we touch and everything we do is going to be impacted by AI.”

Nearly every week, a new headline is made about a new release of a model far more capable than its predecessor or a breakthrough in AI’s domain. It’s hard to keep pace, and even harder if you’re trying to build the hardware to support it.

Scaling like never before

At the heart of the AI revolution, deep under the hood of its machines, are GPUs, whose architecture perfectly suits the matrix-heavy functions required by neural networks. Over the past decade, GPU clusters steadily grew from the tens to the hundreds and into the thousands, establishing themselves as the computational backbone of modern AI.

In the pursuit of ever-greater intelligence, companies are racing to build even larger computing capacities. Meta, for example, recently announced its ambitious roadmap to build out a 350,000 GPU portfolio by the end of this year.

Ober, who has been tracking the capabilities of AI and the hardware needed to power it, notes that over the last decade, AI clusters and model complexity have nearly doubled every month. Connecting those dots, Ober is starting to think about a cluster of a million GPUs.

While hard to fathom, scale has been the engine of the industry—the effectiveness of AI models is directly connected to the availability of vast data and computational resources.

In a 2009 seminal paper, The Unreasonable Effectiveness of Data, three Google researchers highlighted the phenomenon that simple models and algorithms trained on vast amounts of data will often outperform complex ones trained on smaller datasets. And while scaling laws don’t necessarily hold forever, the concept has yet to fizzle.

“There are fundamental reasons why larger models work better—they’re easier to train, they generalize better, and they give better results,” Ober said. “And so that expands the requirement for training. Then, we keep adding new use cases. Every week there’s some new domain where AI is proven to work really, really well. On top of that, there are state-of-the-art changes about every six to nine months. And every time there’s one of those breakthroughs, everybody scrambles to integrate those learnings into their own neural networks, so the models get bigger and need larger clusters to train.”

The AI Hardware Race

This article is part of a series called The AI Hardware Race, which focuses on the evolving systems that run AI. Click here to read the rest.

A synchronous move

When Mark Russinovich, Microsoft Azure CTO, gave an insider’s view into the AI supercomputer infrastructure built to run ChatGPT and other large language models, he said, “An efficient infrastructure for running things at that size is really critical. You have to deal with failures that are happening on a regular basis.”

It’s not that the big cloud players haven’t dealt with the scale of hundreds of thousands of devices before—their bread and butter—but AI workloads are different. They are synchronous. So when a single GPU slows down, all other hundreds or thousands of GPUs will slow down with it. Restarting from failure is fairly common and relies on extremely fast, efficient checkpointing to save the state of the model and avoid losing what may be weeks’ worth of crunching work.

The other challenge of AI’s synchronous nature is that it strains infrastructure in ways that general, heterogeneous cloud workloads—think corporate email meets HR app sprinkled with some data warehousing—never have.

“When you’ve got 100,000 devices running synchronously through idle, max power, partial power, and going through that again and again, millions of times, that can make megawatt swings in very short order,” Ober said. “That’s difficult to handle. They’re too much for capacitance and they’re too quick for today’s grid infrastructure.”

A chip renaissance

While GPUs dominate the AI hardware narrative, they aren’t the only stars of the AI chip show. Everything from TPUs to FPGAs, accelerators, and specialized chips are emerging on the processor menu.

According to Amber Huffman, a principal engineer at Google Cloud who is leading the company’s industry engagement efforts in the data center ecosystem, the diversification is largely driven by the slowdown of Moore’s Law.

“Historically, it just didn’t make economic sense to produce custom chips,” she said. “When Moore’s Law was healthy, you would get a doubling of performance every two years. A custom chip is a journey of more than three to five years, so by the time you bring that custom chip to market, you could instead get 2x (or 4x) for free from Moore’s Law with general purpose chips.”

But now, technologists are designing custom chips that are focused on specific features for specific workloads. “That’s causing the increasing diversity of chips and we’re going to see that continue, especially as companies add more AI-related features,” Huffman said.

The trend is well underway with the established tech giants of the world—Alphabet, Amazon, Apple, Meta, Microsoft, NVIDIA—all designing their own chips. And with startups bringing form factor-breaking ideas like dinner plate sized AI chips or trying to compute like the human brain, the AI chip market is turning into a sparkling candy shop full of options.

For Huffman, one of the most promising developments is the rise of “chiplets.” Squishing multiple computing functions into an increasingly smaller area is becoming more expensive, complex, and at times impossible. Chiplets offer a modular approach that can combine different features, and even different generation nodes, into a custom chip.

“By using chiplets, a design can have an IO chiplet on a mature node, combined with CPU and GPU chiplets on an advanced node that can be pieced together in a LEGO block fashion,” Huffman said.

With much of her work dedicated to open-source initiatives and industry standards organizations, it’s easy to see why Huffman is drawn to the approach. The modular framework allows for more efficient and flexible designs, which could be crucial as AI demands evolve.

“The same open-source collaboration framework used to scale software can be used to develop hardware,” Huffman said.

Data: the lifeblood of AI

It’s easy to get entranced by the growing chip potpourri, but processors can’t do much without getting their hands on massive amounts of data. During training, thousands of GPUs access data storage concurrently. How well that data flows and how fast it can feed the GPU machine will define the training time and how companies can make good on their investment.

“When you think about how capable GPUs are at ingesting data, if we’re not supplying a GPU that data at its throughput potential, then we’re not getting the return-on-investment that could be realized,” explained Niall MacLeod, director of applications engineering for Western Digital’s storage platforms. “So we are duty-bound to drive those GPUs as hard as possible.”

Today, NVMe SSDs are the fastest way to get data to a GPU (volatile memory is cost-prohibitive except in niche use cases). But like all things AI, architecting storage comes with its challenges, particularly if you don’t have the engineering fleet of the tech giants.

“There’s a lot of data path management that needs to happen to the traffic flow inside a GPU server’s architecture,” MacLeod said. “We have to make sure that we understand the ratio of drives to GPUs and their respective throughput, and when we disaggregate that storage, we also need to consider the storage system architecture as well as deal with flow control in the network. Ultimately, we need to ensure that the GPUs aren’t starved of data at any point under load.”

Recently, MacLeod was involved in demonstrating how a Western Digital disaggregated storage chassis (JBOF) could deliver a sustained throughput of around 90GB/s to two GPUs—roughly 6TB every 60 seconds. The GPUs were visually rendering a 250 billion grid point tornado simulation, each with over a dozen attributes such as rain, air pressure, and wind speed. In the demo, it was the GPUs’ processing capabilities that became the bottleneck, with the chassis having 16% throughput left in reserve. Industry pundit Storage Review, who was at the event, wrote: “While some hyperscalers are off making bespoke solutions to their AI data problem, Western Digital (WD) has an answer for the rest of us.”

The other challenge for storage is that AI is a dynamic workload with various stages—ingest, inference, and tuning—that operate in a continuous loop of consumption and generation.

MacLeod explained that Western Digital recognized this when the company introduced its six stages of the AI Data Cycle, strategically aligning its technology roadmap. Some of the company’s new AI-ready products include a PCIe Gen 5 NVMe SSD that pushes the speed for AI training and a capacity-optimized NVMe SSD with more than 60 terabytes of storage (yes, s-i-x-t-y).

Scaling (back) power

As AI continues to push the boundaries of scale, what may be the greatest challenge for the internal workings of its machine is the energy required to power it. Many are concerned about generative AI’s impact on local power grids, and we’ve just gotten started. It’s hard not to wonder how our society will power AI as it scales further to millions of GPUs and beyond.

“That is the question, right?” Ober of NVIDIA said. “And it’s one that everybody I work with is discussing.”

How concerned should we be?

Huffman shared a sense of optimism within the industry. “We’re in a bit of a transient state with AI at an inflection point,” she said. “What we do know is, historically, data center energy consumption has grown much more slowly than demand for computing power. A primary approach to achieving this efficiency has been software-hardware co-design, which leads to better efficiency each generation. I think we will continue to see efficiency improvements as we figure out as an industry how to deliver AI at this new scale.”

Some advances are already recognized. In the eight years since NVIDIA introduced its Pascal GPU, the company has increased work per watt by 45,000x (with its latest Blackwell GPUs).

It’s clear that more innovation will be needed, though. But it can’t rely on improving processors alone. Ravi Goli, a lead principal software engineer at Microsoft—who is developing a new generative AI product—said, “We need to design new algorithms and new architectures that reduce the computing demands. Maybe there is an entirely new architecture required for AI.”

The concept that “bigger is better” has been so central to the rise of generative AI, but unless you are racing to achieve superhuman intelligence, most use cases don’t require that kind of flex.

“With some of the biggest large language models (LLMs) hitting the trillion parameters, you need a huge powerful data center for that—that’s why the industry started exploring Small Language Models (SLMs),” Goli explained. “Small models that, given the right hardware, can run without an internet connection on your local machine. Surprisingly, we saw SLMs performing equivalent to LLMs in some scenarios, given proper data, training methods, and fine-tuning.”

Many in the industry see SLMs as the energy-efficient, compact solution that will be cheaper, faster, and more secure for most business applications.

Clem Delangue, CEO of Hugging Face, a machine learning and data science platform and community, recently said, “You don’t need a million-dollar Formula 1 to get to work every day, and you don’t need a banking customer success chatbot to tell you the meaning of life.”

As AI implementations burgeon, they will not all rely on AI’s biggest engines or their energy footprint. But for those workloads, Huffman of Google suggests assessing them differently.

“There’s been much focus on the increase in power that AI is consuming, which is vital and should be managed proactively and transparently,” she said. “But it’s important to contextualize that with what we are using that AI for. Hypothetically, say we use AI to solve some of the challenges related to climate change. We need to develop our industry accounting methodology for the energy consumed balanced with the benefits delivered.”

The liquid cooling chasm

The final seismic shift in data centers is the concerted transition toward on-chip liquid cooling. NVIDIA’s latest product announcements are a significant crossing of the chasm from air cooled solutions to liquid ones.

“We have now migrated to liquid cooling,” Ober said. “We made air cooling work longer than anybody thought possible—so kudos there—but we’re liquid now.”

Most data centers have tried to avoid on-chip liquid cooling. It isn’t only the aversion to having fluids directly involved in IT equipment and data worth millions of dollars—although that has certainly played a role—but liquid cooling equipment is costly and, for the most part, incompatible with air-cooling. Retrofitting the data center floor requires changing its entire infrastructure, from racks to cabling, in a fundamental way.

While intermittent solutions, like heat exchangers, will bridge the journey, it’s clear that new data centers will be constructed differently. Just how is not yet clear.

Traditionally, a standards body would define the “one true way” of moving forward. But with the pace AI is moving at, those traditional ecosystems struggle. “We don’t have the luxury of spending two years defining a standard and then waiting another two years for it to come out in a product,” Huffman of Google said.

In May this year, Sundar Pichai, Google’s CEO, revealed the company has already deployed ~1GW of liquid cooled capacity, which was 70x the capacity of any other fleet at the time.

Huffman sees liquid cooling coming to the mainstream. She hopes to find the 80-90% the industry can agree on and scale it up as quickly as possible through collaboration at the Open Compute Project. “We should not be under the assumption that we’re going to agree on everything,” she said.

It comes down to economics

Data centers were much more straightforward when servers had a cookie-cutter design, and Moore’s Law reigned. AI upended all of this, diversifying chips, customizing deployments, and pushing every parameter of data center infrastructure.

As AI surges forward, the most significant infrastructure challenge for companies won’t necessarily be technical—but economic. Liquid cooling, fancy chips, and advanced storage and networking hardware promise to keep pushing AI forward. They’re not without hurdles, but they likely won’t be the deal breakers either. A trillion dollars in infrastructure spending may be, though.

For the near future, AI will be a vexing sore for CFOs and IT leaders who need to plan when the state-of-the-art advances so quickly that data centers can be out of date before they’re even constructed. They’ll, somewhat unintuitively, need to make greater investments to see a better return.

AI is poised to define the next era of challenges and possibilities—for the industry and society. The true measure of its success won’t lie in the technology itself but rather in our (human) ability to navigate its transformative potential.

“How do we adapt to AI?” Huffman asked. “No one knows what it will look like exactly, but in my view, it’s clear that we can make it amazing, together.”

Credits

Editor and co-author Ronni Shendar Research and co-author Thomas Ebrahimi Art direction and design Natalia Pambid Illustration Cat Tervo Design Rachel Garcera Content design Kirby Stuart Social media manager Ef Rodriguez Copywriting Miriam Lapis Copywriting Michelle Luong Copywriting Marisa Miller Editor-in-Chief Owen Lystrup