Navigating AI Networking Challenges with Dell's Design Services
Matt Liebowitz, Global Portfolio Lead at Dell Technologies, shares his insights on overcoming AI networking ch
Networking is NOT an afterthought in AI. Get it wrong, and your AI initiatives could face serious bottlenecks and delays. Host David Nicholson is joined by Dell Technologies' Matt Liebowitz, Portfolio Lead, Multicloud Professional Services on this episode of the Six Five On The Road at SC24 for a conversation on Dell's solutions to the challenges of AI networking.
Their discussion covers:
- The unique challenges in networking for AI deployments and the necessity for redesigning traditional data centers
- How AI workloads differ fundamentally from traditional ones, emphasizing the significance of GPU connectivity and efficiency
- Dell's consultative approach to aiding customers through Design Services for AI Networking, highlighting partnerships and solutions tailored for AI networking complexity
Learn more at Dell Technologies. Watch the video below, and be sure to subscribe to our YouTube channel so you never miss an episode.
David Nicholson:
Welcome to Six Five On The Road’s continuing coverage of Supercomputing 2024. I’m Dave Nicholson with Six Five Media and The Futurum Group, and I’ve got a very special guest to talk about networking for AI. Matt Liebowitz from Dell Technologies. Matt, welcome to the program. How are you?
Matt Liebowitz:
I’m good. Thanks for having me.
David Nicholson:
So, let’s straight away talk about networking in the era of AI. We talk about artificial intelligence coming out of SuperComputing 2024. Of course, we’re talking about high-performance computing, supercomputing by definition, that includes computation outside of a single server node, right?
Matt Liebowitz:
That’s right.
David Nicholson:
So, by definition, networking is part of this. There was a time when people said the network is the computer. I think for the sake of this conversation, we can think of it that way, but what are you seeing in terms of customer challenges around networking today from the Dell perspective, specifically around AI?
Matt Liebowitz:
The big thing is that networking for AI is, you mentioned it actually, it’s pretty fundamentally different than the way an enterprise data center would handle networking, where, just like you described, you don’t just have one server doing the work. You have dozens to hundreds of servers all clustered together, with multiple GPUs in each server that have to handle the inflows of data. These large, what they call elephant flows of data, and they have to be able to respond quickly, low latency, handle the high bandwidth. And I think customers, especially the large service providers and the large enterprises, this is just not something that they do every day. It’s a networking technology and framework that they just don’t work with that often, and so many of them are struggling to make sure they can deliver the performance that AI needs to train models, and do inferencing, and ultimately deliver value.
David Nicholson:
So let’s double click on that just a little bit. When you say it’s different, can you give me more specific examples of the differences between tying together a bunch of general compute servers with CPUs in what we’ve been doing for 20, 30 years? How does AI and all of these GPU clusters meaningfully change the equation?
Matt Liebowitz:
I would say the big two… Or really, we’ll start with the first one, is the change in network technology. Most enterprise organizations are not familiar with working with InfiniBand technology. For example, which, if they’re doing high-performance computing, they’re already doing high-frequency trading or drug research or something like that, they might be familiar with InfiniBand, but most are not. It works differently. That cabling is different. The protocols that it uses are different, so it’s a new language they need to learn. And even if they want to use Ethernet, you’re using 800-gigabit Ethernet with specialized switches and using RDMA over Converged Ethernet to have direct memory access between these cluster nodes. Because, again, it’s like you said, it’s not just one server. You’re not just tying one compute node together with another. You’ve got, say, a Dell XE9680 server with eight GPUs. Let’s say you have dozens to hundreds of them in your data center. You have to cluster them all together so that from the application’s perspective, that’s one big server, but behind the scenes, all the data has to flow between those so that it remains performant and can deliver on the application performance that the developers and consumers are looking for, so it’s a pretty big sea change in how networking is handled in a data center.
David Nicholson:
So, you just brought up a great thing to follow up on. You talk about InfiniBand, and you talk about Ethernet. Dell has positioned itself pretty well in the market as the Switzerland of AI solutions. In other words, “Hey, oh, you want a blue one? Yeah, we have that. Oh, no, no, you want yellow. We have that too.” But from a services standpoint, how do you engage customers and help them make those decisions? I imagine that folks, especially if you’re talking about putting together hybrid cloud or on-premises solutions, a lot of them aren’t just buying one XE9680 with eight GPUs in. They might be buying a cluster, therefore, they must be networked with a new networking technology, and they have the question about InfiniBand versus Ethernet. What does it look like? What’s the process look like when you work with customers and help them figure that out?
Matt Liebowitz:
Yeah, so great question. We try to get in front of it before they make a large purchase because the networking is so critical to this. So we’ve launched a series of services or design services for AI networking to try to get in front of that. Help the customers… Well, first understand what are they trying to do. What’s the use case with AI that they’re doing? Are they just doing retrieval, augmented generation, or they’re doing inferencing, or are they actually training a large-scale model? All of that will influence the networking decisions, especially how big the clusters are. How they’re going to be interconnected, so those design services start with figuring out what are they going to do, what’s their plans, and then we deliver an actual design that ultimately results in a bill of materials that says, “You want a blue one? You want a green one? This is what you actually should get if the outcome you’re looking for is X.” And so we can deliver that to them before they’ve made a purchase decision so that when they get all that equipment in place, they’ve got the right systems, and it’s going to perform the way that they need it to.
David Nicholson:
But to be clear, the only dog you have in the hunt is the Dell dog. As far as you’re concerned, if you can create the solution based on Ethernet or InfiniBand, Ethernet, by the way, could come from Dell native Ethernet solutions or packaged Ethernet from others, you guys are agnostic or as agnostic as somebody can be in that space, right?
Matt Liebowitz:
Yeah. So Dell has our solutions for Ethernet, and we work with partners like NVIDIA to deliver, and we actually work very closely with NVIDIA to deliver our services on top of their equipment too. So, if they’re buying NVIDIA Quantum for InfiniBand or if they’re buying Spectrum-X for 800 gigabit Ethernet, we can support that as well. But as you said, we have our own equipment too with our power switch that we can help them with.
David Nicholson:
Got it. Got it. Okay. And so it’s interesting when you talk about going in and discovering what the use case is, I imagine, well, I know for a fact that some of those conversations today start with you being consultative and asking the question, what is the use case we’re trying to solve here? And frankly, a lot of CIOs and CTOs are looking back across the table at you and saying, “You tell me, Matt. We’re trying to figure… My board just asked my CEO, who then immediately called me and asked, ‘How am I going to get ROI, positive ROI out of this AI thing?'” How much does networking choice factor into that quest for positive ROI out of AI? It’s got to be a meaningful cost contributor at some point when you’re building these things out.
Matt Liebowitz:
It’s not small. It’s all about time, the value. I think we saw, and we still see with our backlog, a mad scramble to acquire this equipment. I think your point is they’re acquiring it maybe without a use case in mind. That does happen. The networking plays such a key role because, again, it really depends on what they’re going to do. If they’re looking to start small with a small proof of concept, a handful of servers, the networking they choose, while important, is less impactful. For the large-scale enterprises or the ones that are what we would consider cloud service providers or CSPs that are buying hundreds of these things, the networking is critical. If you get it wrong, the time-to-value will increase significantly, will take longer to train models. The users, the consumers of the AI, will have performance issues, and they just won’t get the value out of it. So that’s why when they say, well, what should we do with this? How can you help us? We have services to help them with use case definition and figure out what they want to do with it, but we have to start with the infrastructure. We don’t get the networking right, the whole thing’s not going to perform the way it should.
David Nicholson:
We glossed past some things. Those of us in this business, we’ll toss out the numbers, the hero numbers for things that are going on. But just to put this in perspective, I get excited when I walk into Home Depot, and I see that you can get a 10/100 switch for 20 bucks, gives me all the bandwidth I need at home, but repeat what those state-of-the-art specs are that we’re dealing with from an InfiniBand and or an Ethernet perspective, and then what is publicly acknowledged as the next thing that’s coming? Where are we in that progression?
Matt Liebowitz:
We have to talk about you buying networking equipment at Home Depot, but that’s for a later conversation. Listen, the state-of-the-art is 800 gigabit, whether we’re talking about InfiniBand or Ethernet. What’s coming next? Even at Dell Tech World, Broadcom was on stage talking about how 1.6 and 3.2 terabits not far away. We’re not talking about 20 years into the future. We are talking years into the future, a handful of years into the future, but state-of-the-art today, 800 gigabit. Many customers that are going with AI today are choosing InfiniBand, that’s the standard. But as you see the rise in 800 gig Ethernet, the rise in technologies like RDMA over Converged Ethernet so you can have that high-performance memory-to-memory transfer between these cluster nodes, but doing it over Ethernet, I think you’re going to see that grow more. The other thing I’ll say is most organizations are pretty familiar with Ethernet, so their comfort zone is likely to be, “I want to stick with the thing I know, which is Ethernet,” so I think we’re going to see that. I think we’re 800 gigabit today, but again, a couple of years in the future, 1.6 terabit is not too far away.
David Nicholson:
Terabit.
Matt Liebowitz:
Well, don’t buy it at Home Depot though.
David Nicholson:
Terabit, say it with me. Isn’t that incredible?
Matt Liebowitz:
Listen, I remember when one megabit was fast. Look at-
David Nicholson:
Yeah, exactly.
Matt Liebowitz:
Right.
David Nicholson:
Exactly.
Matt Liebowitz:
Remember, when you get one megabit and then 10 megabit was fast, or a hundred megabit was fast Ethernet. It doesn’t go down.
David Nicholson:
Exactly. And just to be clear, I only have hair in the front, just to be clear.
Matt Liebowitz:
I have known anyway.
David Nicholson:
I’m right there with you. And if you ever get a chance to see those Tomahawk chips, I think are the Broadcom ones in particular. You look at the little nubs that that 800 gig traffic is going through, and it’s mind-boggling. It’s entering the quantum realm. But let’s get back to this idea of services. We know Dell as a company that provides all of the… Whatever the generic word for Lego bricks might be. All of those building blocks that you can assemble in a variety of different ways and create rack scale solutions for people that include the networking. But you talked a little bit about the services that are involved and going in and doing the original consultation, but maybe walk me through that in a little more depth. What do we get from Dell Services? Why wouldn’t I just want to buy Dell hardware, and then find somebody else to do the services for me? It’s a silly question, but…
Matt Liebowitz:
Yeah. Well, look, again, our services are differentiated in that way to walk through how we do it. We start with our design services for AI networking, give them a framework and a design for what they need based on their use cases. And then, we move forward when they purchase the equipment, not just the servers, but the networking and the interconnects. We can help them with the actual physical installation of that equipment in their data centers, the cabling. I don’t know if you’ve ever been in one of these large-scale AI data centers. It is a joy to see the bundled cables as they move between racks and above racks. It’s actually quite a sight, but all that takes planning, the cables can’t be bent beyond a certain radius or you’re going to have performance problems, so you need to have that installed professionally by people that know what they’re doing, and so we can help them with all of that.
And then it’s not just get it installed. Do we have blinky lights? That’s not enough. We do a deep testing, obviously one, just make sure nothing’s going to fail. If equipment’s going to fail, it’s going to fail during stress tests in the very beginning, but it’s more than that. It’s the performance. We want to make sure your GPU scale networking is going to perform the way we expect it to. And so, we run benchmarks in concert with NVIDIA’s engineering team. We’ve worked with them that have these networking benchmarks. So, we can see from the very beginning, before this gets deployed into production, is it going to be performant? Is it going to meet the use cases that the customer wants? And then, ultimately, once all that is done, we can hand it over to them. We want to do more than just help them buy the equipment. We want to help them design it, get it installed, tested, and then pushed into production so they can start getting value quickly.
David Nicholson:
I think it’s pretty well accepted that the leading edge of AI technology is straining what’s possible with air-only cooling. So, a lot of solutions moving forward, direct liquid. There are even examples of bespoke data centers where they’re immersing boards in oil, but the standard that’s emerging seems to be a water-glycol mix of direct liquid cooling, and we think about that for the CPUs or the GPUs, but your equipment is living in those racks too. So what are the things you have to think about, or are there any considerations when it comes to heat dissipation and thermal density and all of that from the networking engineer’s perspective?
Matt Liebowitz:
Yeah, so liquid cooling is the future for AI, regardless because these things are not going to become less power-hungry in the future, they’ll all become more. In terms of networking, you laugh, but we’ve seen on some early engagements, just running the cables in the wrong way can make it so the air can’t dissipate in a proper way. You get overheating inside the rack. The heat rises up. Guess what’s at the top of that rack? It’s your networking switch. The networking switch overheats, and so it’s those little details of making sure the cables are routed correctly. The right type of cables are in use. You have the proper airflow inside the rack, in between the racks, you have the breakdown of the hot and the cold aisles. All of that is really important. Even if you’re using liquid cooling, all of that is still important.
David Nicholson:
A lot of folks agree at this stage of development in AI that we’ve transcended the GPU-CPU era, frankly, and we’re really in the connectivity era because connectivity plays such a critical role in connecting these devices together. If these devices aren’t connected together, they are worthless. The folks that build these devices might disagree with me. Definitely, the folks that provide networking and networking services, like Matt and his team at Dell Technologies, I think they would nod their heads. But it is obviously critically important to get the network done correctly at the start because as you add, and build out, and grow your clusters, the network can potentially be your bottleneck. Thanks for joining us here. I’m Dave Nicholson with Six Five On The Road. Stay tuned for continuing coverage from SuperComputing 2024.
Other Categories
CYBERSECURITY

Threat Intelligence: Insights on Cybersecurity from Secureworks
Alex Rose from Secureworks joins Shira Rubinoff on the Cybersphere to share his insights on the critical role of threat intelligence in modern cybersecurity efforts, underscoring the importance of proactive, intelligence-driven defense mechanisms.
quantum

Quantum in Action: Insights and Applications with Matt Kinsella
Quantum is no longer a technology of the future; the quantum opportunity is here now. During this keynote conversation, Infleqtion CEO, Matt Kinsella will explore the latest quantum developments and how organizations can best leverage quantum to their advantage.

Accelerating Breakthrough Quantum Applications with Neutral Atoms
Our planet needs major breakthroughs for a more sustainable future and quantum computing promises to provide a path to new solutions in a variety of industry segments. This talk will explore what it takes for quantum computers to be able to solve these significant computational challenges, and will show that the timeline to addressing valuable applications may be sooner than previously thought.