September 16, 2024 11 min read AI

NVIDIA’s Chief Platform Architect on Scaling AI

Robert Ober is the chief platform architect of NVIDIA’s data center products. He works with global hyperscale data centers to architect AI clusters, including some of the largest deployed, and influence NVIDIA’s hardware and software platform roadmaps.

NVIDIA’s Chief Platform Architect on Scaling AI

Some have compared the rise of AI to the birth of the internet. Others have named it a restart or one of the few major computing architectural changes since the introduction of the mainframe. How would you describe it? 

I started my career as a processor architect, about 40 years ago. I’ve been through mainframes and supercomputers, SSDs, wireless, and embedded processors—the bleeding edge of a lot of these things. But those are nothing compared to this transition. 

We’re really seeing the transition from explicit programming to implicit programming, where you don’t necessarily need to be a computing expert to make something profoundly capable. If earlier AIs were about identifying or recommending, what we’ve pivoted to very quietly is synthesis. We’re able to create things—texts, mechanical design, new drugs, even processor architecture. 

We are in the baby steps of a whole new industrial revolution. Any place where there’s a processor, there will be AI. Everything we touch and everything we do is going to be impacted by AI. So, yeah, it’s a big deal.  

You’ve previously discussed the incredible clip at which AI clusters and model complexity have evolved—nearly doubling every month. Is the pace still unabating? 

To be honest, I expected the growth to slow down years ago. I have been tracking it since 2017, and what I have found is that the computational capability needed by an AI model increases at least 10x every year. If I look historically at the software tuning and the hardware upgrades that came alongside it, clusters have grown at about 3x year over year.  

There are fundamental reasons why larger models work better—they’re easier to train, they generalize better, and they give better results. And so that expands the requirement for training. Then, we keep adding new use cases. Every week there’s some new domain where AI is proven to work really, really well. On top of that, there are state-of-the-art changes about every six to nine months. And every time there’s one of those breakthroughs, everybody scrambles to integrate those learnings into their own neural networks, so the models get bigger and need larger clusters to train. 

Working with hyperscalers and researchers, you’ve seen how cluster sizes went from the tens to the hundreds and into the thousands of GPUs. Is this the new norm—tens of thousands of GPUs?  

Yeah, this is what I work on every day. We already have 100,000 GPU clusters being built today and you should expect an awful lot of clusters in the 100,000 to 200,000 GPU domain next year. 

Recently, Western Digital announced its AI Data Cycle storage framework alongside several new products and a roadmap to support AI workloads. What’s your view on the role of storage in the AI Data Cycle?  

I think it’s really important to recognize that there are multiple storage solutions needed all at once, and that they accomplish different tasks. AI is not one thing; it’s a multiplicity, and even when you’re doing one task, the profile of that workload changes dramatically.  

During training, data storage is accessed by thousands of GPUs concurrently. It needs to flow into all these devices from the same data set. So, high-performance, high-capacity data center storage is critical. But it’s not just the obvious uses. Large clusters get interrupted many times per day. And unlike the concept of “five nines” and uptime that we’re used to, continuing to run when something fails is not going to help you in a synchronous workload. If I have 10,000 GPUs and one of them slows down, the others will slow down, waiting, as a result. So instead, you save the state of that multi-terabyte model in the GPUs by having an extremely fast, efficient checkpoint and restore across the data center. That’s strongly correlated to uptime, which directly influences time to train and the overall TCO. 

Historically in the data center, most data was transient and throwaway. We had very ephemeral data. But now, people are saving that data because, in aggregate, there is meaning buried in there. And so, we’re getting new data all the time to do training. We’ll just learn to save every scrap of data we can.  

What would you say is the most significant hurdle today in building AI infrastructure? What is holding us back and what bottlenecks do you see arising?  

That’s a great question. There’s no simple answer, but I would say scale is at the core of that. First, when we scale these things, we need a very balanced architecture. One of the primary reasons is that we never know where this will evolve. You never know what’s going to be trained on the hardware. There might be radically different characteristic AIs running on the same cluster on different days, so we need to have a very balanced capability for compute (petaflops), memory bandwidth, storage capacity, and low communication latency.  

I think the difficulty for AI is it’s an unbelievable full-stack optimization, including worrying about the physical data center design. Planning for the future, I’m starting to think about clusters of a million GPUs, and you can imagine that would require gigawatts of power. So now it comes down to power delivery, liquid cooling, network bandwidth latency, the physical packaging, and the electromechanical design—all these things are really dependent on optimizing that overall architecture in a balanced way.  

The AI Hardware Race

This article is part of a series called The AI Hardware Race, which focuses on the evolving systems that run AI. Click here to read the rest.

Power seems to remain a riddle. With generative AI becoming so pervasive, signaling an insatiable appetite for power-hungry processing, how will we keep up with the power demand?  

Yeah, that is the question, right? And it’s one that everybody I work with is discussing. On one hand, AI models are getting bigger, more complex, and use larger-scale clusters. On the other hand, breakthroughs, especially in generative AI, are enabling increased efficiencies in almost every human endeavor and design. In eight years since our Pascal GPU, we’ve increased work per watt by 45,000x with our latest Blackwell GPUs. AI is enabling us to do new things really efficiently, that we’ve never even been able to do before. 

The first step in building new data centers that are architected for AI is securing power. There are many sources that we’ll use along the path to energy-efficient AI, like solar in solar-friendly locations, backed by newer types of batteries to keep almost the same uptime as the grid. Iceland has data centers leveraging geo-thermal.  

What I’m seeing in the hyperscale space is that private power also helps eliminate many of the challenges with attaching to the grid. The problem becomes: how fast can you build a private power source compared to how fast you can build the rest of the AI data center? So, it’s still a problem we’re wrestling with. 

Let’s dig into one of those power problems. The extreme computational horsepower of AI brings with it transients—rapid current changes and spiking voltage. Can you briefly explain this problem and what it means for data center architecture design? 

At a simple level, AI training is a sequence, an iterative optimization of something. You use samples to train weights, you update the weights, you share the updates across the entire cluster of those things, and then you do it again. And again. You do it millions of times across a synchronous cluster.  

AI training is not a heterogenous data center operation where you can rely on everything running a different workload and average out. It’s a very different way of doing things. And when you’ve got 100,000 devices running synchronously through idle, max power, partial power, and going through that again and again, millions of times, that can make megawatt swings in very short order. That’s difficult to handle. They’re too much for capacitance and they’re too quick for today’s grid infrastructure.  

This is one of the reasons why private power generation off the grid can make sense. But each generation of GPU is getting better, and we’re continually working on architectural advancements at the silicon, system, cluster, framework, and application levels that can minimize these rapid swings and still allow the cluster to work at its full performance. That’s important for TCO, efficiency, and time to solution.  

With all that heat GPUs produce, is liquid cooling inevitable? What does it take to adopt these solutions and are data centers ready for this change?   

You can see from NVIDIA’s latest product announcements, yes. For training, this boat has sailed. We have now migrated to liquid cooling. We made air cooling work longer than anybody thought possible—so kudos there—but we’re liquid now. 

One of the difficulties has been working with the entire ecosystem to enable the infrastructure for this transition. Most data centers were designed around air cooling. Most of the capacity in the world is air-cooled. So we’ve worked with partners to develop sidecars or heat exchangers that are essentially a big radiator with pumps. That allows you to have the liquid-cooled GPU compute rack, and then next to it, or in the row with it, you have exchangers that can dump the heat into the air in the data center to continue leveraging existing cooling systems. 

You sit at the heart of the world’s AI infrastructure, and you’re a key part of the breakneck speed at which NVIDIA introduces new chips and devices. What’s your philosophy for designing a roadmap for next-generation hardware?  

It’s actually a pretty deep question, philosophically. But as a company, we sort of have some guiding principles on the next generation and that’s: what does physics allow? What are the boundaries that we could even push to? What’s the best possible solution that anybody could make? And what are researchers trying to accomplish? Those become our goalposts. That is what we aim for. We want to be as close to perfection as possible. We’re always competing with perfection, not with what somebody else is making.

Then, even more important, we need to prepare and enable the environment and the ecosystem. We need to have our partners—for example, cloud providers—in a position where they can take that into their data centers and they can roll it in at scale very quickly. 

The whirlwind introduction of new neural networks and AI capabilities does not go hand in hand with the long and involved process of procuring IT infrastructure. How can organizations make smart hardware investments that will be valuable and interoperable in the long run? 

Yeah, it’s an outstanding question, and it’s true. The cadence is unbelievable, and it is anathema to CFOs and IT departments. We’ve now gone through many years of that with the big cloud vendors, and we’ve finally settled into an understanding and a cadence. So the things I would suggest for an IT department is that first, know that you will deploy AI equipment. You may not know today what exactly that equipment is, but start planning your facilities for it and make sure that they are capable. Power infrastructure or the cooling infrastructure is something you can plan for right now.  

Second, I would suggest over-provisioning at the start—making sure you have a lot of memory, a lot of network, and a lot of storage. Regardless of what you’ll buy, after a few years of AI evolution, you will wish you had more computational capability. And what we find over and over again is that the most capable, most expensive platforms actually give the best TCO and will last the longest time. So if you can afford it, it’s always best to get the most capable platform you can. It will save you money in the long run.  

Third, is to know that the advancements in neural networks will very quickly consume resources. Enterprises should be updating execution containers, updating drivers, and updating frameworks. Software updates will radically improve your throughput and bring profound TCO improvements. It’s often 2x per year throughput improvements on the same hardware.  

Lastly, as far as a way for enterprises to get into AI, I would suggest it’s probably better to start with some pre-trained foundational models and fine-tune those to your needs. That’s a way you can take an organization that is not very savvy on AI, and it is a very specialized discipline, and learn the ropes with the least investment possible.   

Credits

Editor and co-author Ronni Shendar Research and co-author Thomas Ebrahimi Art direction and design Natalia Pambid Illustration Cat Tervo Design Rachel Garcera Content design Kirby Stuart Social media manager Ef Rodriguez Copywriting Miriam Lapis Copywriting Michelle Luong Copywriting Marisa Miller Editor-in-Chief Owen Lystrup