The Future of Enterprise AI Connectivity
In the realm of advancing AI, having a cohesive edge-to-cloud AI network infrastructure is crucial. Ethernet, based on industry-standard protocols, is rapidly emerging as the preferred networking technology for cost-effective and flexible AI solutions, outperforming today’s proprietary alternatives. In this session, Intel will share its vision for how AI network infrastructure needs to evolve to meet the growing demands AI brings to networks and the importance of building this evolution using open standards, such as the Ultra Ethernet Consortium. Intel will also discuss its portfolio of Ethernet based products for AI available today, as well as hints of plans for the future.
Takeaways:
- AI is advancing fast, requiring innovations in AI network infrastructure to meet future AI demands
- Industry standards, such as Ethernet and UEC, enable an open ecosystem and vendor choice when building AI infrastructure
- Intel has a broad, growing, portfolio of AI networking products, all based on Ethernet
Transcript
Will Townsend:
Hey, I want to welcome all of our viewers to The Six Five Summit. It’s AI Unleashed, and we’re going to talk about AI as it relates to networking. This is the connected intelligent edge track. But before we get started, I want to provide a little bit of context. So in the realm of advancing AI, there’s a need for robust connectivity tissue, and that is networking. Ethernet is emerging as a viable alternative to proprietary interconnect architectures. And in this session we’re going to spend time with Intel. The company is going to share its vision around AI networking infrastructure, what it’s doing around supporting open standards, and we’ll touch on the Ultra Ethernet consortium as well. So with that, I’ve got Thomas joining. Thomas, welcome.
Thomas Scheibe:
Hey, thanks for having me. Really, really looking forward to this.
Will Townsend:
I am as well. And so lets kind of set the context broadly. So, certainly the needs for AI connectivity are much different than the data center and traditional enterprise network. So I’d love it, Thomas, and you could spend some time around how Intel sees that.
Thomas Scheibe:
Yeah, no, thanks for standing as office with this general look at where does fit in with current networking. And as some viewers know networking is my DNA for the last 20 years. So yeah, I mean, we’re looking at AI connectivity. Quite frankly, actually it’s an addition what I see, it’s in addition to what is networks do today and going to hear me saying more about this. But what really sets it apart from a traditional enterprise environment is you really have these very expensive GPUs. Not just CPUs but also GPU based service for AI. And you have a need for reliable transport and there are different ways to do that, but you really have a need for reliable transport to minimize the downside of the expense of compute cycles, whether CPU or GPU to get actually the best outcome. And as we all know, these products are not cheap.
If you want to get to the TCO, you need to use them as much as you can, which means you need to have a reliable networking infrastructure. So that’s, I think, where it starts. And so then the next one, when enterprises look at this and saying, “Hey, I have Ethernet point everywhere today.” Love the operational simplicity. Ideally, you do not have dedicated fabrics. You run every single Ethernet. And so now the next question number comes, how do I make that happen? How do I advance what I have to do today with Ethernet and make this reliable? So that, a couple ways to think about it. And that gets us a little bit on where the industry is going and where we as Intel, driving together with partners into ecosystem. The different ways to do what most today actually do, actually I want to say pretty much everywhere in the Ethernet world, is you say, “Hey, to get to reliable transport, I make the fabric lossless on the Ethernet level,” which is really the RDMA or converged Ethernet.
Most people are just use it like RoCE, RoCE. And that’s basically what it is. It’s proven, but it takes settings and optimization and an Ethernet fabric to make that happen. And then the second way is to do, hey, I just assume my Ethernet and fabric is like it’s today for pretty much every application, a lossy fabric, and then I need to implement a protocol on the edge, on the server on the NIC. And this is where the Intel IPU comes in, with a protocol such as Falcon, which was open sourced by Google to actually get to the same outcome, which is a reliable transfer over ethernet fabric. And so this is, coming back where we open up is, what’s different, you want reliable transport for AI workloads and you want ideally doing this on Ethernet because customers just love the operational simplicity of Ethernet. So let me stop there. Hope I didn’t go too much into, but you have more questions.
Will Townsend:
No, no, no, that’s perfect. And from my perspective, Ethernet has been around for decades, Thomas, and you spoke to that ease of management and the fact that it’s been around for a very long time. But there are also some other advantages relative to other proprietary architectures, for example, it’s more open, there’s more flexibility. Any other light that you can shed in that regard from sort of a total cost of ownership perspective?
Thomas Scheibe:
Yeah. And that’s a good point and I don’t want to come across as saying, “Hey Ethernet, as technology is always the right answer.” It’s not just a technology answer as I kind of alluded to. I think what customers are looking at is really a couple items. They do want a standard space solution. They do want an ecosystem because they do want choice to have various vendors in this market. And it’s not just because they like more than one vendor, it’s just if you have module vendors, you just have better risk diversification. We all remember the supply chain situations that we had. Some of them still have them where you wait months and months and months on getting products. So having choice is a good thing. It always drives more innovation faster. I think that’s another one why customers look at choice. And then yeah, at the end you do want a technology that does the job, which gets us then a little bit into where the standards are going with Ethernet.
Because Ethernet is around, to your point, forever, it’s the broadest install base there is by far. And so, calling out some of the other ones, and I have like 20 plus years in this industry. I remember we were talking initially, there was not in the data center world, but we’re talking like ATM and SONET, all those became Ethernet and IP. And then you had Fibre Channel, all of this moves onto IP and Ethernet. And then you have InfiniBand, which was always in that corner for HPCI, performance computer Wireman. And so this is the latest discussion, “Hey, can we make this work on Ethernet?” So we basically have the volume which drives send costs and innovation much faster with Ethernet. And so really, the whole discussion is how can they use Ethernet to get through a reliable transport similar to what InfiniBand can do today in a very specific niche. And so that’s really what customers are pushing for and what the industry is stepping up to deliver.
Will Townsend:
I totally agree, and I love your comment about competition breeds innovation. I use that all the time. When you look at Ethernet, I mean certainly Intel is very focused there, but your competitors are as well. And so you’re seeing Marvell and Broadcom invest in Ethernet and address some of the concerns around performance that have existed with comparisons to InfiniBand and that sort of thing. In the debate around open, I mean Intel is very open standards oriented with X86 and what you’re doing from a networking perspective. But can you speak to some of the other advantages with open standards?
Thomas Scheibe:
Yeah, maybe a couple point because you said is an ecosystem. I mean I basically look at this. If we’re looking at the AI, and I don’t like to use the term explosion, but the tremendous growth that comes out of it just from a network connectivity perspective, which is real. It is very clear. And this is going to go on for a couple of years, at least. There’s no doubt in my mind. And I don’t think this is about are you first or are you the only one? The market is so big. I mean there’s plenty to grow for everyone. So that’s why I actually look at this as an opportunity. And I think most of my peers, and you mentioned some of them looking at the same way, they’re looking at this, this is a great opportunity to grow and deliver value for our customers.
So I think there’s growth for everyone. So number one. Number two, in terms of performance, there’s this very technical debate going around InfiniBand, and I pick InfiniBand because that’s the piece where the Ethernet, InfiniBand and Ethernet transition is where it’s going to happen, is what most people call the scale out network or to put a note on there, this is basically how they connect GPU servers to other GPU servers. That’s scale out network, very high bandwidth between server nodes or REX, depending on what design GPU server design you’re on. And realistically, as I said, what it really comes down to get to a reliable transport, you want to minimize latency, you want minimize the time it takes to do retransmission.
As I kind of alluded to, there are two ways to do this. One is a very expensive way. This is what InfiniBand does you really enforce losslessness. However, that takes away some of the scale out large scale ability where you say, “Hey, if I can take some loss, but I do very fast retransmit, selective retransmit, can react very fast to congestion and route around it or steer traffic around and get to reliable transport that hits my performance criteria.” Now I can scale out to very, very large fabrics, which is where the trend is going. And so this is really where Ethernet is going to shine from a technology perspective. And quite frankly, based on some of the initial data that we’re seeing, we’ll actually do the same and better what InfiniBand can do today from a scale perspective.
Will Townsend:
And from my perspective, a lot of the innovation is being driven through the Ultra Ethernet Consortium. And I know that is something that Intel is very involved in. And I’m wondering if you could spend a little bit of time for our viewers and explain what are the goals and how is Intel specifically contributing to the UEC?
Thomas Scheibe:
Yeah, the UEC, I want to say it, probably started over 12 months ago when this came into when awakening happened, quite frankly, the ChatGPT was like the seminal moment when people realized, man, this is so easy to use this technology. Now let’s figure out where we can use it. Everybody said, oh, I can use it everywhere, which is true. So Intel is a founding member of UEC because we looked at this and to your point earlier, we’re a big proponent of open standards and enabling ecosystem, not only on the Xeon side but on connectivity as well here.
And so the mission really is if you look at the UEC, it’s an Ethernet based open interoperability, high performance, full communication stacks architecture to meet the demand, not just what HPC did in the past, but to what AI needs now. And so this is where the focus, and if you look at there are different knobs what you can draw on there, layer one working groups, layer two, layer three. And then there’s where you go at the webs and the look fabric layer. What are some of the optimizations to do to actually deliver truly reliable transfers optimize low latency transport. And so Intel is participating in these very working groups besides being on the steering committee as well. We’re quite frankly seeing, if you look at the membership, it’s just the whole industry is coalescing around it. I think at this point it’s pretty much everybody that plays in that space as part of the UEC and is pulling together.
Will Townsend:
That’s great. Let’s dive a little bit deeper into Intel’s portfolio and can you provide some examples of what you’re doing to your point to drive AI connectivity at scale out speed?
Thomas Scheibe:
Yeah. And you bring up a good point, because there’s a little bit, whenever you have to… It’s not a hype cycle. Whenever you have these initial run, people focus on the high end. I need the next one. The high end is real. So don’t get me wrong with SD, I need the next 16,000, 32,000, 48,000, node large learning model clusters. And these are real, these are happening. Quite frankly, what’s really interesting to me is the build out cycles I’m used to as every three years you build another version. In the AI world, we’re in a 12 watt build out cycle on the height, which is super interesting. Again, it comes back to it’s actually an opportunity for the whole industry. Not just Intel, it’s for all industry.
Will Townsend:
Well, and these GPU clusters are massive. So I spent time with GTC, I was at Deltech world this past week and it’s just mind-blowing. And you’ve got to have the right level of connectivity to make the plumbing all work.
Thomas Scheibe:
Yeah. And where we’re going with this is a little bit sometimes is lost when you’re in that initial stage when a market’s just booming. Everybody talks about what do I need on the high end because it’s like the next biggest thing. And that’s important, that’s what I’m saying. And so that’s really where you do need to have reliable transfer and optimization to get the best performance out of it because, again, these GPUs are not cheap today, part of them, as the way I think about it, because they’re really general purpose. And one of the trends you will see, and I got a little bit off track here, but I think you see over the years a lot of indications out there, you will be have a lot of purpose build GPUs, some customers or service what I call them TPU accelerators. But that will come over time.
But today what you have, you have these general purpose, very expensive GPUs and you want to run them at the optimal utilization because time is money. And so that’s one. This is where you want true, reliable transport. And this is one of the things what we have implemented with Falcon technology, and this is where the UEC gains as well, and we will see more optimization in that space. However, there are also a lot of smaller clusters. Think about 1000, maybe 2000 GPU node. Those ones you can actually do today. And I have seen this with enterprise customers as well as actually some of the sales orders to the CSPs today, they’re deploying RoCE v2. It takes a lot of tuning. But for smaller classes you can do. And in this world you would use Ethernet adapters and the enter portfolio, you would use the foundational, the Ethernet controller, the 800 series. Or you can use the Intel IPU adapter that also supports RoCE v2.
And then you send for more larger scale, for a larger scale cluster you can use than Falcon as a protocol to get a better reliable transmit solution. Main point really what I’m trying to get across is, if you look as an enterprise and I start today, you take your existing Ethernet fabric, you have two choices. You can turn on RoCE v2 for the nodes that you need losses Ethernet to support medium-sized clusters works perfectly fine.
If you want to have a better performance, you would use an IPU NIC. You don’t actually have to do anything on the network at this point. You don’t have to turn on the lossless features. You just use a lossy fabric and run Falcon between the end parts using the NIC. And then for the very large cloud providers, they will go down this path of what USC is working on a truly, truly fully optimized Ethernet solution for labor transport.
Will Townsend:
And I’d love to touch on what you’re doing from a software perspective as well because obviously, you’re a leader in silicon, but software comes into play. And you’ve touched on this a little bit, but I know that there’s one EPI and then there’s the Intel Ethernet Fabrics Week. Can you spend a little bit of time on those points as well?
Thomas Scheibe:
And that’s a very important piece because it’s so fun, particularly for people and admit on my history coming out of the hardware silicon world, it’s so fun about tying up performance what you can do. But unless you have the right self abstraction layer and make it easy to consume, quite frankly, a lot of this doesn’t matter. And so what is really interesting in is space, with Intel, we have this product called the Intel Ethernet Fabric Suite, which actually was built around HPCI performance computer wire, which is actually very transferable to what is needed in the AI world. And then obviously we’re pushing around one API. And so if you look at us, what is really needed in this space, applications built for the AI world, more and more are either based on what PyTorch or riding on top of a particular awareness communication library.
And what you really would like to do here to actually enable an ecosystem, you want a software layer that can pluck another top to what PyTorch does to do when you build against that or the communication library one that particular vendor. So you don’t have to change that. What you do today, have that Fabric Suite in the middle and then underneath have the drivers plug in from a broad vendor ecosystem, not just Intel NICs, but also ecosystem vendor NICs as well as switches. And so basically, again, on that software layer, provide a high performance open source layer that people can build on. So we actually enable the adoption, accelerate the adoption of reliable transport mechanisms for AI workloads.
Will Townsend:
No, I love what Intel is doing there because you’re reducing the friction for deployment and adoption. And I think that’s a super positive thing. But I mean this is challenging. You spoke to explosion, I like to use the term the gold rush tied to generative AI. And I think there’s still a lot of organizations that are still scratching their heads, they’re taking an information from different sources, they’re area one thing about InfiniBand, one thing about Ethernet, but besides connectivity, I mean, Thomas, from your perspective, what else is important as enterprises consider taking this journey to the next generation AI?
Thomas Scheibe:
Yeah, no, no, good point. And by the way, I love the term gold rush. Actually I should have started there because I think-
Will Townsend:
I like to use that.
Thomas Scheibe:
Yeah, no, listen, I mean in the end, if you pull it up before I get answer that question is really, I mean this is all about unlocking the value of data. The goal is the data and how do it get to the value of that data faster using AI? I mean that’s why I think it’s a beautiful analogy because we really understand, hey, everybody has a lot of data, how do I actually get to the value? But anyway, you asked besides connectivity, right? So we covered this. And I think that’s the important piece because again, GPUs are not cheap. Connectivity needs to make sure the GPU is humming and not sitting still.
And that’s very, very important going forward because at some point you need to be able to make this a good TCO case and a good ROI over time, there needs to be some return. And as I say, there are multiple pieces besides the network needs to make sure the GPUs are humming. And then obviously you’re going to start optimizing, do I really need the biggest GPU for everything or do I have some optimizations going on for certain use cases? Obviously, the other thing that the whole industry is nervous about longer term is these things are very power hungry. How do I make sure if I drive this performance, I actually get my power consumption down? Because-
Will Townsend:
Well, that’s what I hear, Thomas, when I sit through various aims with Intel and some of your competitors and I go to events, one of the biggest concerns is around the power envelope. And everything that I’m hearing from silicon providers including Intel is around, what you’re doing to drive a more sustainable set of silicon. That’s super important.
Thomas Scheibe:
You absolutely have to do that. And then the next piece that will come and you see some of the articles, which is, I think, is very obvious because it is data and you are trying to unlock the value of data. And a lot of this is data that is proprietary information for enterprises. Might be customer data that you want to use the application, what do you want to drive value? And I’m not talking about the, hey, how can I answer the general question on the internet faster? I’m talking about… Let’s pick an example. I have a customer calling in for support. I want to go through all the data I have about this customer and using an AI model to come with a good answer up much, much faster than normally. And so I think that data you need to protect. And so data is needed everywhere.
It’s needed for training of the model, is needed for answering the question for the curious for the inference. And so you need to build in these AI infrastructure security for the data going in, data going out, data in flight. You probably, what your team is saying is like, “Yeah, Thomas, this sounds very similar to a lot of the instincts where we’re talking about, security and networking.” And the answer is yes. So what needs to happen here, when we build these AI cloud our customers building is AI clusters, you need to think through how do you protect the data going in and out? How do you protect the address in these deployments? Whether it’s encryption for the data coming in, data going out? Whether this is our syndication who can have access to the models? How do you scale? The other piece we haven’t even talked about is how do you get actually big chunks of data into models for learning?
You don’t want to do this over old infrastructure. Ideally, you want to do this over NVME or fabric or IP, get data very efficiently in and out, but that runs around IP network. So again, how do you make sure that you encrypt stuff? And so this is where actually I spent a lot of cycles with the product center that I have, which is the Intel IPU is to think through, not just can I get a reliable transport, but can I actually add security? And the beautiful thing about the Intel IPU is I actually have an embedded CPU complex that can house a lot of the security functions inside of that NIC. So when you have data coming in and out of a server, I can actually intercept and do a lot of these things there in terms of encrypting or syndicating, seeing who has access, reporting on who has access. And so this is a very, very important piece, particularly in the enterprise where you’re worried about where’s your data going. One of the items, I do want to call out a little bit of an advertisement here.
Will Townsend:
No problem. I’m about to make one as well when you’re done.
Thomas Scheibe:
Yeah. One last call because I know you’ve probably like, when does he stop talking? We actually work very, very closely with Red Hat, given that we have a lot of joint customers in the enterprise space on inboxing. Our Intel IPU was Red OpenShift and Red at Enterprise Linux. And inbox, it really means making sure it is supported out of the box when you support this for developers are looking at using this. And so the reason why that’s important, again, because we basically give access to the CPU on complex that is embedded in their IPU NIC for developers to run their security functions on, that becomes very, very easy for them to use their standard tools they use today to also manage that little protection and enclave LAN on the IPU NIC. So back to you.
Will Townsend:
No, I’m glad you touched on security because it’s a huge issue. I mean not only data security, data sovereignty, but protecting the large language models as well. I mean there’s IP that’s tied to that. And so here’s my gratuitous plug. Intel did commission a paper with my firm More Insights and Strategy and you can find it on the More Insights and Strategy website, but it’s around confidential computing and it goes into detail, Thomas, exactly what you were addressing around silicon level security. And we’re following that up later this year with a confidential AI paper that Intel is also commissioning for my firm. So that is my gratuitous plug to match yours. But hey my friend, it’s been a great conversation, but I’d like to sort of wind things up with something that’s a little aspirational. So, can you share Intel’s vision for AI connectivity to put kind of a bow on the conversation we’ve had today?
Thomas Scheibe:
Yeah, absolutely. And quite frankly, a lot of fun having this discussion.
Will Townsend:
Likewise.
Thomas Scheibe:
We kind of went a little bit too, but if I have to bring it down to the three points, truly the market is big, it’s growing and it would grow and keep growing if there is an open ecosystem. We have seen this in the past in different areas and it’s the same as the Ultra Ethernet. There needs to be an open ecosystem. And so whereas Intel in to supporting and working with partners. They need standardization both for scale up for the Ultra Ethernet Consortium is doing, sorry, for scale out as well as for scale up, which is connectivity within a GPU server.
And so we’re working with the industry and making sure standardization happen because that’s the fastest radio innovation. And the other thing that I see a need for broad partnerships that I’m very much committed to, Intel is committed to because again, in the end, this is like a win-win-win for a lot of the players. It’s not just connectivity, it’s not just computer security as we touched on, is how do we deal with data? There’s so many opportunities in this market. So broad partnerships, standardization and open ecosystem are probably the top three for us where the vision is in the space.
Will Townsend:
I couldn’t agree more. But with that, I want to thank our viewers for tuning in. If you liked what you saw today, click that like button. And Thomas, thanks again for a great conversation. We could probably keep talking for another hour or two.
Thomas Scheibe:
We’ll take you up at some future point. I really appreciate it. Thanks for the time.
Will Townsend:
Thank you.