Building AI infrastructure for a Global Network

With the launch of Workers AI, Cloudflare has deployed GPUs at scale around the world. In doing so it confronted many challenges that would be familiar to any other enterprise looking to deploy accelerated computing at scale, as well as some unique challenges.

Syona Sarma, Head of Hardware Engineering at Cloudflare will share lessons they learned in the process. audience members who’re responsible for enabling AI and machine learning in their own organizations will leave with new ideas and tools to support their companies’ AI transformation.

Transcript

Dave Nicholson:
Hello, Dave Nicholson here. Welcome back to Six Five Summit. We’ve got a very exciting guest lined up, Syona Sarma, Senior Director, Hardware Engineering at Cloudflare. It’s important that you remember what I just said, Hardware Engineering. We’re going to have an awesome discussion about infrastructure in the cloud and AI space. Welcome.

Syona Sarma:
Thank you.

Dave Nicholson:
How are you?

Syona Sarma:
I’m good and excited to be here, and look forward to sharing more about the Cloudflare perspective on building out AI infrastructure at-

Dave Nicholson:
Absolutely. Well, tell us more about Cloudflare. For people who aren’t familiar with Cloudflare, a little bit of its history. Fill us in a little bit. Who is Cloudflare? What do you do?

Syona Sarma:
Cloudflare started about 13 years ago as a CDN provider, and as time evolved, started supporting different types of services that included network and security, specifically things like DDoS mitigation, VPNs, Bot management, etc. And finally, we are in a place where we’re able to support a developer platform that we call Workers, in order to build out a developer ecosystem.

As of last year, we’ve ventured into the AI entrance space with a product called Workers AI and a couple of other AI platform solutions. And given that we are an Edge network with a presence across the world, we think we have an inherent advantage with latency, which serves as well entrance types of application.

Dave Nicholson:
Okay, so first of all, CDN, the TLA three-letter acronym, Content Delivery Network, correct?

Syona Sarma:
Yeah.

Dave Nicholson:
So you have a history of having this infrastructure. Give me a little bit more of a sense of the scale of the network that you’ve built out. What does that mean, exactly? What’s your reach?

Syona Sarma:
It means a presence in all of the different regions of the world. And we separate out from multi-color points of presence to small data centers in the most remote corners. That, we think, is the specialty of Cloudflare in terms of reach. Another thing I should mention is our architecture is built such that every service, irrespective of what bucket it lands in, runs at the same level of performance, is as secure and reliable as any other place in the world.

Dave Nicholson:
Okay, so on the subject of services, you mentioned Workers, Workers AI, what do you call it?

Syona Sarma:
Yeah. So we built this AI entrance is a service solution called Workers AI. It’s built on our developer platform, which is called Workers. And, what we are attempting to do here is to support a developer ecosystem. It’s more than just an Inference product, though. We are attempting to provide an AI platform solution that includes Inference as a service, but also gives you a place to store, and use, and train your models pre-train, fine-tune your models. I want to be clear, we are not in the full training space yet and don’t think we’ve ever, given we’re an Edge solution.

Dave Nicholson:
So, so far we’ve talked about, we’ve loosely talked about. the hardware infrastructure in the sense of the scale of your network. But, let’s get down to it. What about GPUs? And is it fair to assume that Inference as a service includes the deployment of GPUs in your network?

Syona Sarma:
Oh, yes. So we started out, like I said, as a CDN providers. So we already had an existing compute and storage infrastructure, and we use that and leverage that to add GPU attached to our servers in order to be able to support Inference. I do want to mention in addition to Inference, we also have an R2 solution, which is our object stored solution, because Inference is really in the context of a larger system, it’s not a workload on its own. And along with a couple of other tools that help you rate, limit, and manage your workloads if you’re running training elsewhere, and one we run Inference on Cloudflare, we have a platform solution available. And we’re in the midst of rolling our GPUs to be in milliseconds of eyeball latency. So it’s going pretty well so far, and we have a growing list of models that we’re supporting in partnership with Hugging Face and another Chase like that.

Dave Nicholson:
So you have this network, and now you’re going to do Inference as a service and other things in the AI space. Is it as simple as opening up the servers that are in these various points of presence and sticking a GPU into a PCIE slide, is that all you have to do?

Syona Sarma:
No, it’s much more complicated than that. And particularly from a hardware side, there are a bunch of considerations and challenges that we faced that I would like to talk through, in case it’s helpful to someone who’s either bending out their own infrastructure, or choosing with solution to run their AI applications on. What are certain things that you should be thinking about? Starting out with, what is the type of accelerator you want to provide? So it starts by considering what is the space in the target market segment that you’re looking at? So with Cloudflare we’re focused on Inference and Inference in the context of the larger system, not training for a start.

The second thing is to understand what types of workloads even within Inference you want to support, because the characteristics of these workloads differ widely and from a hardware perspective, and given the long hardware product life cycles, it’s really important to understand if you would want a one-size-fits-all solution or a custom-built accelerator solution.

Dave Nicholson:
Yeah. There’s been, I think we… A bit of a theme has developed during Six Five Summit when we’re talking about infrastructure, and cloud, and AI. And, that theme is that there isn’t a single tool to do all jobs and the 1,000 watt or 2,000 watt GPU is not the only way that you can accelerate AI operations. I just want to double click on your mention of choosing the right Inference accelerator. Can you give me an example of what some of those choices might look like? And I imagine that sometimes the considerations have to do with power and efficiency for a given job, but what’s a version of an accelerator that you offer?

Syona Sarma:
A lot of this comes down to, in the case of Cloudflare, what we can support in our infrastructure, like I mentioned. Racks and servers have different system constraints that they operate in. But our goal is to provide the same level of performance across the board, which means we have to do a lot in terms of system design constraints and optimizing to them. So things like power, thermal, top of the list. What is the system level current node power that you can accommodate? What is the RAT level power that you can accommodate? What are the bottlenecks that you’re most likely to see, whether it’s memory capacity, or bandwidth, or network? So, having a view of these bottlenecks and monitoring them as they change is really important as we build out a roadmap.

Now, I want to just focus on the roadmap, because a single solution is not going to meet all your needs at the pace that things are changing. So, we have more of a phased approach where we deploy a certain type of an accelerator. Right now, we’ve been focused on N&Ms mostly, because that’s what our customers would like to see. But it’s smaller to mid-size models with higher throughput and latency requirements right now. In the future, we will want to choose something that’s more fine-tuning, more custom, build your own type of use case, at which point we might want to transition to a different type of accelerator. So it’s really becoming a multi-pronged approach where you have different solutions or different types of solutions, and our attempt is to make that invisible to the customer in terms of what they see in terms of the KPIs that they need.

Dave Nicholson:
Interesting point on making it invisible to the customer because I would argue that we’re coming out of an age where the mantra was, “Cloud first, I don’t care about the infrastructure. Infrastructure doesn’t matter. In fact, we’re running serverless applications, which some people started to believe didn’t include servers on the backend. So, I think the point that we’re both making here is, it’s really important to pay attention to the infrastructure and deploying it correctly.

Now, the customer doesn’t have to because Cloudflare or some infrastructure person will on the backend. Can you give me an example of how this works in the real world, maybe customer examples, or at least industry use cases for the Inference as a service and the new services you’re delivering?

Syona Sarma:
Yeah, so we have a couple of different buckets of Inference workloads that we started out characterizing. The first was Inference, the second, N&Ms, and the third, Recommenders. There was not as much interest in the Recommendation system workload, primarily because it’s on the edge and needs a whole lot of memory capacity, which our systems are not built to provide. So we decided to focus in on the other two buckets. And in terms of the exact requirements, they vary between high memory-bound workloads to high compute-bound workloads. So we started out by saying we would like to support smaller models, say 7 billion parameters with the throughput requirements that would be satisfying to customers, but also have the capability in our systems to support larger Inference models, say up to 50 billion parameters, and still provide the latency benefit at the edge from a customer standpoint.

Dave Nicholson:
Well, very, very exciting times for the world of AI and Cloudflare. Syona, is there something that we missed here? What else should people understand about the latest and greatest from Cloudflare?

Syona Sarma:
We are looking at an AI platform solution that includes Workers AI, which is serving Inference, but we also have two other products that I want to mention. One is AI Gateway product, and this is meant for customers who are not running Training on Cloudflare and would like to move to Inference on Cloudflare. We have tools that allow you to cache, and rate limit, and run data analytics, and give you a seamless way to transition to Cloudflare to be able to run Inference.

The other product that I want to mention is our R2, which is an object store, which will help you if you are trying to do any sort of retraining or fine-tuning, which is a customer use case that is becoming dominant with data sets that are more customized to the use case that you are looking at. So, want to emphasize, we’re trying to become the platform for Edge Inference, not necessarily just the place to be able to run Inference.

Dave Nicholson:
Fantastic. Syona, thank you so much for spending time with us here at Six Five Summit. Lots of exciting stuff in the infrastructure space coming from Cloudflare. Stay tuned for more from Six Five Summit, coming right up.

Other Categories