Industry Firsts Insights on Dell PowerEdge XE9680 with Intel Gaudi 3

Dell Tech, Intel, and Metrum AI – an AI dream team? Host David Nicholson is with Dell TechnologiesManya Rastogi and Metrum AI‘s Steen Graham for Six Five On The Road at SC24 to discuss the groundbreaking potential of the Dell PowerEdgeXE9680 server, featuring Intel Gaudi 3 technology.

Their discussion covers:

  • The collaboration between Dell Tech, Intel, and Metrum AI in developing the PowerEdge XE9680
  • Unique features and benefits of the PowerEdge XE9680 for AI and high-performance computing applications
  • The role of Intel Gaudi 3 accelerators in enhancing #ML workloads
  • Insights into Metrum AI’s integration and application experiences with the new server
  • Thoughts on future technologies and trends in the AI and computing space

Learn more at Dell Technologies and Metrum AI.

Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.

Transcript

David Nicholson: Welcome to SC24, the SuperComputing Conference in Atlanta, Georgia. I am here at the Dell Technologies presence. Here at SC24. And the reason why that’s important is because in the era of AI, it’s important to have choices. And Dell has always been, if you will, the Switzerland of IT choice. Sometimes certain technologies get dominated by one player or another. Dell has always held strong on the idea that its customers ultimately drive choice. I have two fantastic guests to talk about this very subject today, Manya from Dell Technologies. Welcome Manya.

Manya Rastogi: Thank you.

David Nicholson: And Steen from Metrum AI.

Steen Graham: Great to be here.

David Nicholson: Welcome both. I understand that Manya, you’ve been working on something pretty special with a certain GPU from a certain company. Tell us about it.

Manya Rastogi: Of course. So I mean, we are excited to talk about XE9680 Server. Dell Power Edge Server, which is the ultimate AI server in the industry right now. With the plan for silicon diversity from Dell, we are now offering this XE9680 with Gaudi 3 from Intel. So it’s already available as a restricted RTS to some customers for testing. But come December it’s going to be available to everyone and it’s ready to ship and just super excited to get this offering in the market.

David Nicholson: So Manya can say that the XE9680 with Gaudi is cool, because she is part of Dell Technologies. You’re with Metrum AI, so you’re a bit of a third party objective observer. What say you Steen on the subject of Gaudi and what you’re seeing so far?

Steen Graham: Yeah, innovation today is GPU constrained or some companies would prefer to say AI accelerators. And I think the two of those companies that would prefer to say AI accelerators also are supporting the XE9680. So what we get in the XE9680 is no compromise solution where we end up with the industry leader and then we’ve got choice among the other two AI accelerators in the market. It’s absolutely fantastic.

David Nicholson: So Manya, I alluded to this earlier, Dell is about customer choice. What are the challenges that you are seeing customers facing today when it comes to thinking about AI?

Manya Rastogi: That’s a great question, and honestly, particularly with this offering with Gaudi 3, there are few challenges that we are trying to solve. So for example, first one is of course the choice with GPU or AI accelerator. Second thing is that the customers don’t have to be tied into a proprietary software or networking, which is one of the main differentiators with Gaudi 3 from Intel, the networking, it’s all based on Rocky and it’s from Open Compute, so it’s not proprietary. And then third thing I would say is the scale-out that it offers the scalability with the XE9680 and Gaudi 3 offering, while still maintaining the cost of the whole infrastructure. So those are some of the great things.

David Nicholson: So you can obviously deliver as you’re referencing an XE9680. That has what, eight Gaudi 3?

Manya Rastogi: There are like eight different cards. It’s called an OAM Form Factor. So it’s Open Compute, Accelerator Module.

David Nicholson: Okay. In 6U.

Manya Rastogi: In 6U, yeah.

David Nicholson: So in that cabinet though, can Dell also, or is Dell planning to deliver sort of rack-scale infrastructure solutions based on Gaudi?

Manya Rastogi: Yeah, so it will scale out. You can connect from starting from two nodes, four nodes up to 1632. And that form factor, ultimately that’s the scalability that we want to offer. And it’s connected with the Dell Z98680 Port of Switch with the 86OSFP ports that it offers.

David Nicholson: Okay.

Manya Rastogi: So yes, all that scalability with the rack integration is part of the offering.

David Nicholson: So Steen, you and Metrum are known kind of colloquially as the most feared man among GPU manufacturers because of your relentless pursuit of the truth, when it comes to performance. But what are you seeing in terms of relative performance from these devices? Do they make sense?

Steen Graham: Yeah, I mean, absolutely. And I think we’ve had unique opportunity thanks to our tight collaboration with Dell to run a tremendous amount of code. And we’ve seen the evolution of new entrants into the GPU or AI accelerator market, and I think Gaudi 3 is up to the task of participating in this market. And Intel’s made some bold claims about performance, and we’re thrilled to see their software stack evolve. And we’ve made a few tweaks to it ourselves, and we’re seeing it reach those claims. And I think as Manya alluded to, that claim includes a TCO story as well. So it’s not just performance, it’s a performance per cost story as well. So I think Gaudi 3’s on a great trajectory as it enters the market. I mean, we’re downstairs right below us is live. We’re remote into Round Rock running one of the few Gaudi 3 systems in the world. We built a full agenetic rag stack for internet service provider, customer support agents, and it’s working fantastic.

So that whole software ecosystem on tap of Gaudi is ready to go. And especially for the AI workloads, I think as you walk around, SC24, we like to talk about traditional HPC as well, but from the AI stack, I think we’re ready to go run things like Agenetic Rag and going to have a good TCO story, and it’s going to be complimentary of the other two players in the market as well.

David Nicholson: Okay. So when you look into the future, you see choice and healthy competition among these folks. So Manya, let’s double click on that kind of performance question. The value question, what are some of the ways that you measure how Gaudi 3 is stacking up against others that you offer?

Manya Rastogi: So we try to focus on the workloads that customers really want in the market for AI. So influencing, training, fine-tuning distributed fine-tuning, and that’s all in the partnership with Metrum, with Steen, the team has worked, bringing up that software stack, deploying those applications. But those are some of the numbers, like on tokens per second is one aspect just for AI. But at the same time, we can see, okay, CPU utilization, GPU memory utilization and things like that. And when we compare those with the other vendors, we are trying to build up to the same level so that we can have apples to apple comparison. But at the same time, it’s public knowledge like Gaudi 3 in the market is introduced at a lesser price point then the competitors. So ultimately it does impact the cost story overall TCO.

David Nicholson: Now it’s always valid to test at the component level, and you can go down to a level of disaggregation to where you’re testing a DIM, just the memory module itself. And then from there you build up into higher stack. Steen, from your perspective, how relevant is it to test just the server versus the entire rack scale architecture? Should we be looking for what these entire rack solutions end up developing over time? Will that be another data point that’s important?

Steen Graham: It’s an incredibly important data point as you scale out and we can’t not talk about it at super computes. I think once you scale things out, the worst case scenario is you become constrained on your network or you be constrained on your storage. Because the most expensive component are these AI accelerators or these GPUs. So my long time at Intel, they taught me it’s the half a billion dollar lithography equipment that we need to make sure is the bottleneck, not the two million dollar tester on the backend test. Right. That’s economics 1-0-1, it’s also manufacturing throughput 1-0-1. And I think as you scale these clusters up as you do more complex distributed inferencing or training tasks or fine-tuning tasks, you absolutely need to think about that architecture as well.

We’re going to be, we actually just rolled out a blog on enabling Gaudi 3 with RDMA over Converse Ethernet. That’s basic blocking and tackling. You get dramatic performance improvements when you go do the work on the Ethernet, as well. So networking’s your first intro into a bottleneck and you have to move around a lot of data when you do massive clusters and storage in certain scenarios. When you’re sharing model weights as well, in addition to network is going to be a bottleneck as well. So you really have to think through that. And it would be a shame if you buy all these fantastic AI accelerators and your bottleneck’s somewhere else in that cluster.

David Nicholson: Yeah, chasing bottlenecks is part of the fun of IT. It’s like a game of whack-a-mole forever. Manya, what’s the target market for XE9680 in general. Can we even think of it that way? Are we sort of waiting to see what the market demands? Are we imagining three or four of these systems in a rack, in an on-premises data center for an enterprise client doing things? Or is this primarily a sort of cloud scale play, or what does it look like? Do we know yet?

Manya Rastogi: It’s AI era, we all know that. So targeting all the customers, like both enterprise and cloud, depending on their use cases, what they want to do. I believe it’s not limited to some sort of customers. It has the capability with XE9680. You can scale up or be on-prem or it can be on cloud. You can do things with it. But one thing to keep in mind, specific point I want to mention that there is all the supply chain issues that are happening. This is a good opportunity to go ahead with Gaudi 3 and start some development work for the customers.

David Nicholson: Yeah, that’s a good point.

Manya Rastogi: So that also brings it into perspective that, okay, if they want to start something on-prem right now and they can go ahead and do that.

David Nicholson: And what about this idea that if 80% of infrastructure spend globally is on training now at some point, that’s going to flip in the direction of inference. And again, don’t quote me on those exact stats, but the general feeling that we’re doing a lot of training now, we’re going to be doing a lot more inferencing later. Do these systems land on premises? Is this confirmation of hybridity?

Steen Graham: Yeah. Well I, first, I think your point is very valid. We’re at this kind of, it’s hard to tell when training and inference are going to flip, but what I will say is, as more AI applications get adopted, there’s definitely more inference, workload and cost does matter. And you’ve seen the leading AI research labs move to smaller models as well to drive that affordability. So once they get massive scale, they want to drive affordability. The other things that are going on too is when you implement more of a chain of thought reasoning structure, which actually we’re demoing live right now on Gaudi 3 below chain of thought reasoning, and you give the agent access to think and access to a few APIs, that actually dramatically increases the amount of inferences that are going on because it’s no longer a human in a chat bot driving the inference.

This agent is actually having to think independently and reason through steps. And so that dramatically increases the amount of inference. And I think a lot of people are going to have the tough conversation of where do I want my IP to sit? Where do I want my critical thinking to sit? And what is the best TCO story? There’s a lot of value in cloud, hands off. And then there’s a lot of value in, if you think you’ve got proprietary IP, proprietary workflows and a good TCO story, maybe you want to go on-prem, both on the fine-tuning and the inference component as well.

David Nicholson: Manya final minute. Any other thoughts on what you’ve seen so far with Gaudi.

Manya Rastogi: I’ll say it’s just starting keep an eye out. So like I said, it will be available to all customers in December, targeting around that time frame. And then Gaudi 3 will also be coming in the PCIE form factor, and that’s around mid next year with the XE offerings and also with Intel Granite Rapids, the next Gen CPUs. So it’s a good story from Intel perspective, we have both CPU and an accelerator from their site. So yeah, keep an eye out for that server.

David Nicholson: Fantastic. Thanks to both of you. I don’t think it can be overstated how important it is that Dell Technologies as an organization is supporting choice moving forward in this marketplace. All sorts of pressure comes to bear from a variety of directions in these industries. And Dell is always held firm with the idea that customer choice is first and foremost, and they’re continuing to do that. Great discussion about Intel Gaudi 3. Thanks to both of you for Six Five On the Road. I’m Dave Nicholson. Stay tuned for more content.

Other Categories