September 26, 2024 11 min read AI

Building the Blueprint for an AI-Optimized Data Center

Amber Huffman is a principal engineer at Google Cloud, leading industry engagement efforts in the data center ecosystem. She is on the board of multiple open source and industry organizations and has led numerous industry standards to successful adoption.

Building the Blueprint for an AI-Optimized Data Center

You’ve joined Google Cloud during one of the most significant technological changes in the industry. What’s it like being at the heart of this new era? 

What’s most exciting for me being in Google Cloud is being at the forefront of this exciting new world and seeing these leading edge models, the chips, the data centers, and having an end-to-end insider view of how our world is changing. All of us grappling with this exciting new world appreciate how it will profoundly change the way we learn, work, play, and interact.  

When I was in college, the web was emerging. People were concerned about the internet and its impact on society. It was scary, but it turned out to be an immensely useful tool that has moved us all forward. We all think change is scary. So, how do we adapt to AI? No one knows what it will look like exactly, but in my view, it’s clear that we can make it amazing, together. 

In one of your recent talks, you said that open-source hardware will revolutionize technology over the next two decades. What do you mean by that?  

Open-source software has been powerful, enabling developers from around the world to build on each other’s ideas and accelerate innovation. More than 90% of today’s software projects have open source in them. It’s been foundational to the cloud and now for AI—you see open-source software infused throughout generative AI (genAI).  

As Moore’s Law has slowed, companies are making their own chips to deliver better solutions. And when you make a chip, you define it in a software language. The same open-source collaboration framework used to scale software can be used to develop hardware. We are seeing the emergence of the first open-source hardware collaborations and it could be as transformative to our industry over the next two decades as open-source software has been over the last two decades.  

There’s a lot of concern about security in the age of AI. People see risks of rogue agents, deepfakes, or malicious manipulation of training data. How prepared are we to handle these challenges? Can you share some insights from your recent work? 

Security is definitely a journey. As a community, we must constantly evolve to meet the emerging threats. My work on security has focused on ensuring devices, whether they’re SSDs, hard drives, CPUs, or GPUs, haven’t been compromised by a bad actor. 

A project I’m especially excited about is Caliptra that’s done in partnership between the Open Compute Project and the CHIPS Alliance. It’s a hardware root of trust block that determines if a piece of software has been tampered with by checking its digital fingerprint. We are currently enhancing Caliptra with an NVM Express key management feature that ensures that if an SSD were stolen from a data center, or was infiltrated by a sophisticated attack, you won’t be able to extract the user data. 

Caliptra includes a complete open-source hardware implementation that was developed in deep partnership across historically intense competitors. Google and Microsoft compete in the cloud; AMD and NVIDIA compete on chips. The magic of open source is bringing disparate teams with different worldviews together to solve a common problem. And when everyone is collaborating to harden a security block, that collective effort of the best minds delivers a more robust solution.  

As you talk about ensuring trust and transparency between systems on chip, you’re not only referring to GPUs, CPUs, and accelerators but also to storage and SSDs. How would you define the role of storage in AI architectures?  

AI requires a lot of data, and storage plays a huge role. One of the things we need to continue to do with storage and memory is refine the hierarchy seamlessly for hot, warm, and cold data.  

If you’re training a large model, it can take a few months. And while you’re doing that training, there are checkpoints taken frequently to ensure that if there’s a failure, you can avoid losing a lot of great work and resume from that last checkpoint. These checkpoints today are stored in memory. However, it would be super helpful if there were a way for checkpoints to be supported by storage in a more seamless way moving forward.  

Overall, as these larger models continue to grow, the data used to train them becomes even more important. And that’s where storage and memory continue to come in. They are central to AI and everything we do.  

The AI Hardware Race

This article is part of a series called The AI Hardware Race, which focuses on the evolving systems that run AI. Click here to read the rest.

You have been at the helm of defining standards and form factors that have propelled the industry over the last two decades. What kind of standards does the industry need in the age of AI?  

Standards are already playing a huge role in AI and fostering an open and collaborative ecosystem. We’re well underway on that journey with AI solutions using industry standards like NVM Express for SSDs, Universal Chiplet Interconnect Express for chiplets, JEDEC for memory, and more.  

A challenge the industry faces is the incredible pace at which AI moves—the traditional ecosystem struggles to support it in a meaningful way. We don’t have the luxury of spending two years defining a standard and then waiting another two years for it to come out in a product. The Open Compute Project has been an essential forum for companies to come together to identify commonalities and move faster than a traditional standard. As one example, there’s a rack specification called Open Rack V3. Google and Meta both implement Open Rack V3 in our data centers, however, only 90% of it is common. There are a few pieces that are different, like the bus bar location. And that’s okay! The time it would have taken to agree on that last 10% outweighs the benefit. 

AI is really pushing the state of the art in terms of power and cooling infrastructure, with a major focus on liquid cooling. How do we ensure that colocation facilities have liquid cooling infrastructure that any customer could use? A standards body would say there has to be one true way, which is (as the saying goes) like throwing the baby out with the bathwater. So I see this as an opportunity to say, hey, what’s the 80 or 90% we can agree on and scale up this capability in the industry and not spend hours and hours arguing and debating over the best way to do it. Because there are different constraints each of us deals with, and we should not be under the assumption that we’re going to agree on everything. 

With Moore’s Law slowing and AI’s data-hungry needs, we’re seeing a rush of hardware diversification. What kind of new hardware can we expect in the next decade?  

Historically, it just didn’t make economic sense to produce custom chips. When Moore’s Law was healthy, you would get a doubling of performance every two years. A custom chip is a journey of more than three to five years, so by the time you bring that custom chip to market, you could instead get 2x (or 4x) for free from Moore’s Law with general purpose chips.  

With Moore’s Law slowing, the performance uplift of general purpose chips is typically 10% or 20% each generation. Now to build compelling products, technologists are designing custom chips that have only the specific features that provide value for that product. That’s causing the increasing diversity of chips, and we’re going to see that continue, especially as companies add more AI-related features. 

One thing that I find fascinating is how this AI moment came about—it came about through accidental innovation where hardware was used for a purpose that was different than it was originally intended. Specifically, an NVIDIA GPU designed for graphics was used to win the ImageNet AI competition in 2013. So how do we stumble upon and accidentally walk into the next scientific breakthrough that would unlock entirely new areas of innovation? That’s hard to predict. And this is the reason I think open-source hardware is interesting. If you think of a thousand flowers blooming, there’s going to be a few that actually pay off, and we can move innovation forward more quickly by making more bets and collectively seeing which ones work out. 

Amid that future AI chip potpourri is also the concept of chiplets. You’re on the Universal Chiplet Express Interconnect board of directors. How do you see the role of chiplets shaping the future of AI?  

I think chiplets are extremely important as we move forward, and they emerged for numerous reasons. One reason is simply that some large chips can’t fit on a single reticle, so you’re forced to split it into multiple chiplets across more than one reticle. 

Another reason is that as we move to more advanced nodes, the cost increases dramatically. The price of a single EUV tool is pretty crazy. And so you only want to place the components that benefit most—such as CPU or GPU logic circuits—on the most advanced nodes, whereas you want to leave memory and IO on an older mature process that has been fully tested. By using chiplets, a design can have an IO chiplet on a mature node, combined with CPU and GPU chiplets on an advanced node, that can be pieced together in a LEGO block fashion.  

Chiplets will increasingly be used throughout the industry, especially as ease of collaboration is enhanced through standards, like the new manageability features in the UCIe 2.0 specification released in August. 

Energy efficiency was a top concern even before AI made its current splash. But generative AI’s energy consumption is gargantuan in comparison. How would you see the industry go about solving this? 

This is certainly an important challenge, and one we are very engaged with. I think there are a few ways to look at this. One is that we’re in a bit of a transient state with AI at an inflection point. What we do know is, historically, data center energy consumption has grown much more slowly than demand for computing power. A primary approach to achieving this efficiency has been software-hardware co-design, which leads to better efficiency each generation. For example, our sixth-generation TPU is the most performant TPU to date—it’s 67% more energy-efficient than TPU v5e. I think we will continue to see efficiency improvements as we figure out as an industry how to deliver AI at this new scale. 

A second piece is that there’s been much focus on the increase in power that AI is consuming, which is vital and should be managed proactively and transparently. But it’s important to contextualize that with what we are using that AI for. Hypothetically, say we use AI to solve some of the challenges related to climate change. We need to develop our industry accounting methodology for the energy consumed balanced with the benefits delivered.   

At Google, for example, we already use AI to help airlines mitigate the warming effects of contrails, to reduce vehicle emissions in cities, and to do more accurate weather forecasting to understand whether wind or solar power in a data center location is sufficient. Do we need to move our data center load to a different data center because we’re not going to get the sustainable power we’re expecting here? In that way, we’re already utilizing AI to do a better job at combating climate change. 

Before we sign off, you’re recognized as an inclusive leader, a passionate mentor, and an advocate of the power of community. What role will these qualities play in an era defined by AI?  

My opinion is that mentorship, inclusion, and community are still going to be critical. Humans are making AI, and humans need to collaborate to figure out how to move forward. Sociology has shown that decisions are made based on emotion versus logic alone, which I have seen throughout my career. We still have to help each other have more compassion and coach people to see other people’s perspectives. How do young people, who I feel are almost at a disadvantage of less direct communication growing up because of these online tools, collaborate effectively in the real world, and how can we sponsor and support them in their development? Building deeper human connection is vital because that gives meaning to our lives. 

Credits

Editor and co-author Ronni Shendar Research and co-author Thomas Ebrahimi Art direction and design Natalia Pambid Illustration Cat Tervo Design Rachel Garcera Content design Kirby Stuart Social media manager Ef Rodriguez Copywriting Miriam Lapis Copywriting Michelle Luong Copywriting Marisa Miller Editor-in-Chief Owen Lystrup