The AI Ecosystem: AMD’s Full Stack Data Center Strategy

Six Five Media hosts Daniel Newman and Patrick Moorhead are at AMD Advancing AI where they’re joined by AMD‘s Forrest Norrod, Executive Vice President and General Manager, Data Center Solutions Business Group, for a conversation on AMD’s strategic evolution toward becoming a data center solutions provider and the latest collaboration driving forward AI and chip ecosystems.

Their discussion covers:

  • An overview of AMD’s evolved strategy from compute engines to comprehensive data center solutions
  • AMD’s integration of AI, Instinct Accelerators, and EPYC CPUs, and the role of ecosystem collaboration in offering full-stack solutions
  • The significant impact of EPYC in the data center market, AMD’s market share growth, and customer feedback
  • The evolving landscape of AI technologies, the role of GPUs and CPUs, and strategic advice for customers exploring AI

Learn more at AMD and AMD Advancing AI.

Transcript

Patrick Moorhead: The Six Five is On the Road here in San Francisco at AMD’s Advancing AI event. Second annual event. I mean, guess what? It’s all about AI. Data center AI, client AI, and the supporting cast to make it happen. CPU, GPU, networking, NPUs. Great tech.

Daniel Newman: Pat, we covered the Gambit. It really did show AMD’s wares across data center, client. And I think there were even a few surprises in there, and it was just an overall really positive event for overall the perspective on AI, and of course for the company.

Patrick Moorhead: Yeah, the event was primarily data center. And I need to pull in Forrest Norrod, who runs that business here at AMD, to talk about it. Forrest, welcome back at The Six Five.

Forrest Norrod: Thanks a lot. Great to be here with you guys.

Patrick Moorhead: Yeah, I mean, last year was a home run. And we’re all wondering what could you do this year. And it was pretty exciting, did a couple broadcast interviews, and congratulations.

Forrest Norrod: Well, thank you so much. Super proud of what the team has done.

Patrick Moorhead: Yeah.

Daniel Newman: Yeah, it was a really compelling mix, and I think you heard in a bit of the build-up there. What is AMD in terms of data center, and client, and there’s this movement with AI PC, and even networking? And so you have a data center under your purview. So you had some pretty comprehensive announcements around Epyc, you had some pretty big announcements around Instinct, and of course a number of sizable announcements in networking, which I think was a little new, maybe even a little surprising to people. Talk about how you’re sort of evolving from a company that’s focused on moving many, many parts, to a company that’s really building systems and an entire AI stack for companies.

Forrest Norrod: Yeah, well, we’ve been on a journey in the data center really for the last decade, of trying to first reestablish ourselves as a credible component supplier of initially server CPUs. But then moving from that, to extend into GPUs, and then extend across the data center into the rest of the gear. And so networking was the obvious next step, in terms of the silicon that you need. You need a CPU, you need a GPU, you need a way to put it all together. But when you look beyond that, the complexity of the systems that we’re building nowadays, is getting so high.
Just the power, the density, the challenges around building a 200 kilowatt rack with 70 plus GPUs in it and CPUs, et cetera, you have to start really thinking about it as you’re not designing chips anymore. You have to think from the beginning that you’re designing systems. Because if you don’t, you’re going to screw it up. And so we’ve been trying to be on this journey of moving from first component after component, get broad coverage of world-class components, and then bring in some world-class capability to build up system solutions as well.

Patrick Moorhead: Yeah, it is interesting how it has evolved. Every huge inflection point in technology, whether it was mainframes to minis, minis to client server, and then you add on different workloads. I remember when even the database was putting stresses on memory, and by the way, it still is, and storage. But having to have in alignment, CPU, and today GPU, storage, memory, and networking, is important. And you talked about the simplicity of that. Can you talk a little bit about any other reasons than simplicity that people want to buy full solutions today? Is it time to market, or is that really simplicity? Is it performance reliability? Why is that?

Forrest Norrod: I think it’s a little bit of all of those. And I think, one is most people don’t want to put the solution together themselves. So particularly for enterprises or say tier-two cloud players, they may not have the interest or the capability to assemble best-of-breed components into an integrated solution and make it work. And so they look to companies like AMD, working with our partners, Dell, HPE, Lenovo, et cetera, to put together full-up solutions. And that’s what they want to buy. It’s funny, I sometimes tell my team, nobody wants to buy a server CPU. They don’t want to buy a server, they want a solution. They want a solution to a problem. And so I think that’s the dominant reason. But then beyond that, you really have to… If you put these things together, you can gain more performance, you can optimize the interfaces, you can look for ways to optimize data flows, you can manage, monitor, and respond to failures better. And so it’s not just about solution, it’s about making the solution work well in the data center.

Patrick Moorhead: It also sounds like it’s really about have it your way, right? Because partnering with other people in the ecosystem, they might want to piece part it. Right? And you’re open, so you’re giving them open pathways to be able to do that.

Forrest Norrod: True.

Patrick Moorhead: But you’re also… You know? It’s almost some people want the easy button or the easier button. And at least what our research suggests is even that most even training runs bomb out because of the network. And second is because there’s maybe an issue with the GPU. And it sounds like what you’re doing on the networking side, first of all, DPU is driving it all on the front end with Pensando and even the back end now, with this new AI based NIC to, sounds like it, to relieve congestion on the back end of the networks where they either have GPUs or accelerators.

Forrest Norrod: No, that’s right. I mean, networking I think is becoming an appreciated part of the overall problem of these big GPU clusters. And you’re exactly right. I mean, when you think about these things at scale, 10,000, 20,000, 30,000 GPUs, you’re going to have a failure every few hours. Just the laws of physics dictate, no matter how resilient you are. And so number one, you want to have high reliability solutions so that you minimize those times. But then two, you have to have the ability to recover. Recover, God forbid, roll back to a checkpoint if you absolutely have to. But the last thing you want to do is restart the job. And so for us, networking is a critical part of having that overall solution with the resilience, the monitoring, even the predictive ability to say, “Hey, this part is likely to fail,” and we start to see errors accumulating. And so yeah, the Pensando technology, to be able to offer high-speed networking, but in a fully programmable way that you can add high-value services on top of it, is critical.

Daniel Newman: It also seems like a really substantial TAM opportunity as you look to expand. Because every one of these, you’re the fastest ramped part with the new instincts in the GPU product. But we’ve seen the value of systems and what it’s created for this market and this industry. And the opportunity for AMD to provide more of the total solution is certainly going to increase the company’s opportunity to drive more revenue and of course better outcomes for customers. I also think it’s important to mention you made some really great progress on software. And I think the software-hardware connection right now is symbiotic. I mean, it’s really important in this AI era that the software is enabled, the developers buy in, and now this technology which has been on par for some time, can fully realize the potential.

I’d be remiss to not talk about Epyc, though. I know everyone wants to spend all the time talking about the AI chip, but CPU and GPU or another symbiotic thing. And you’ve done a remarkably good job of winning the cloud providers. I think 50, 60%, we’ve heard, numbers as high as 80% with certain cloud providers on Epyc. You’ve made some really great announcements there. Talk just a little bit about how that roadmap evolves. What do you attribute so much success to? And do you think you can keep going? Is there more market to gain?

Forrest Norrod: Yeah, I think… Well, first off, I’m incredibly proud of what the team has done. Epyc is an incredible product, it’s a great… The whole series of products. It’s a great roadmap and the team’s done an incredible job. We’re honored to get such a high share with the cloud. And the thing that I’ll attribute that to is that at the end of the day, for the cloud players, the data center is not a cost center. It’s their factory. And so their products are produced in the data center. And so making that data center more efficient, making it higher performance, translates immediately into cost of goods sold, cost per service user, cost per query, cost per YouTube video. And so because of that, the folks making selection on the server CPUs for the data center customers have a direct connection to the CEO. They’re critical in driving business results.

And so they’re going to embrace the superior solution. Sort of devoid of any other considerations. I’ll contrast that to the classic enterprise, where the data center is an integral part of their business, but essentially it plays a supporting role. It’s not their factory, it is a cost center. And I think there, the CIOs are much more concerned about risk, about business interruption. They don’t want to disrupt the business. And so they’re calculus in making a decision of whether or not to embrace something new is different. I think that over time, as we’ve continued to deliver the acceptance of Epyc, and the fear of the new, is the acceptance is growing, the fear is diminishing. And so I think we’re starting to see more and more customers embrace Epyc in enterprise on-prem as well. And I’ll tell you, the other interesting thing is we’ve noticed, and maybe we should have seen this before, but there’s an interesting on-ramp here.

The easiest way to get enterprises onto Epyc allay that fear is to get them to try it in the cloud. It’s easy to do, it’s an easy switch. And once they start seeing it in cloud, and everybody’s using a hybrid environment now, so they all have some cloud deployment, it makes it much easier to go back and talk to them about, “Well, let’s talk about on-prem as well.

Patrick Moorhead: Yeah, it’s interesting behavior in the enterprise. You have some enterprise that say, “Oh my gosh, some hot new technology is great in the cloud. I need to adopt that, or I will adopt that in the future.” Others might think, “Oh, that’s not for me,” but ultimately they do end up at what the hyperscalers want to use. It might not look the same. I mean, even networking offload that you have with the Pensando today, had been in hyperscalers for years before it came to the enterprise. So I fully believe that you will have an even greater story as people get more comfortable with Epyc there. I do want to talk about the future of the CPU. It’s funny, a lot of memes out there, a lot of discussions around, as it relates to AI, the CPU doesn’t matter, right? Oh, it’s a head end.

I mean, you’ve got a bunch of headless GPU servers, and it’s a nice traffic cop out there. And then the other side, it’s kind of like, “Well, history says that when things get mainstream, you want to integrate a lot of that onto the CPU. Great latency, lower power, either by putting blocks on there or algorithms that support it. What’s your view of the future of the CPU in AI and its importance?

Forrest Norrod: Yeah, I think that the CPU can play or does play a pretty important role in determining the performance of an AI system. So let’s actually keep this simple. Let’s just talk about it. CPU feeding a GPU AI cluster. One of the things we talked about today and we showed on stage, is that the CPU can make a big impact on the performance of the GPU cluster. So this thought that the CPU doesn’t matter, it’s actually easy to disprove. And we’ve shown that moving from, say, a fire rapids head node, to a Epyc head node, can give you 10 to 15% more performance on inference. By the way, for MI300, for AMD Instinct accelerators, or Nvidia, and for some training applications, we’ve shown 20% performance uplift.

Everything else held constant. All we’ve done is move a higher performance CPU in there. Now why is that? It’s because of Amdahl’s law, right? By speeding up that part of the algorithm that the CPU is executing, you’re just speeding up the whole process, sort of proportionally the amount of work that’s being done there. And so it can make a very big difference. And I think that as we really recognize this and as we play this forward, we’ll design CPUs that are even better at feeding, accelerating, and orchestrating GPUs.

Patrick Moorhead: Interesting. Not a lot of people know that.

Daniel Newman: Yeah, I think you sort of see it when you see these big systems that are combined with a CPU, GPU, and why the most advanced ones. And that’s why I use the word symbiotic, is because I think people… And we’ve actually seen it in the market. There are a number of companies, and it became a little taboo to kind of talk about the CPU for AI. But we still know a lot of inferencing is done on CPUs, including on Epyc.

Forrest Norrod: That’s true as well.

Daniel Newman: Forrest, I just want to thank you so much for joining us. I know it’s been a really busy day. Congratulations on all the announcements, and we look forward to hopefully having you back soon, whether it’s next year at Advancing AI in 2025 or hopefully sometime sooner.

Forrest Norrod: Very good. Well, thanks a lot guys. Appreciate the opportunity to chat.

Patrick Moorhead: Yeah, thanks Forrest.

Daniel Newman: And thank you for tuning into this episode of The Six Five. We are On the Road here at the Advancing AI event for AMD in San Francisco. Hit subscribe, join us for all of our content and coverage from the event. There was a lot of it. And be part of our community. We appreciate you, but we got to go for now. See you later.

Other Categories