Building Enterprise AI Platform using Hypercomputer

Google Cloud believes in offering flexibility and choice of products so customers can build the AI platform that suits their unique needs. We’ll share how Google Cloud empowers customers to build their AI platform to attain better performance/$ and shorter time to market. Tune in to hear how Google Cloud’s Hypercomputer enables you to build innovative AI applications without having to worry about underlying infrastructure, portability, compatibility, reliability and scalability issues.

Transcript

Dave Nicholson:
Welcome back to Six Five Summit. The conversation about AI continues. I’m Dave Nicholson and I’ve got a very special guest from Google, Mr. Maulin Patel is director of product Management for GKE, Google Kubernetes. And what does the E stand for, Maulin?

Maulin Patel:
Engine.

Dave Nicholson:
Engine. GKE. We’re going to be talking a lot about GKE. Welcome Maulin.

Maulin Patel:
Thank you Dave, and glad to be here. So a little more introduction on my side. I’ve been with Google for six years, leading the GKE team, owning very exciting areas, including AI platform, our stateful workload business, and also our gaming business. Before coming to Google, I was a general manager at GE, and before that I used to be principal data scientist. I’m super excited and until to talk about all about AI today. So take it away, Dave.

Dave Nicholson:
Fantastic, fantastic. Yeah, it’s always good to know. I had a mentor tell me once, “Always let people know a little bit about yourself beforehand, so they know why they should listen to you.” Maulin is someone who we should listen to. He has a lot of practical experience, which always makes conversations interesting. There are a variety of ways that people can build and deploy AI solutions in the Google Cloud platform, GCP for short. You are on the sort of GKE side of things, but what are some of the things people need to consider when they go in looking at whether or not they’re going to deploy via the sort of self-managed GKE route versus a more managed environment, a more managed platform like Vertex? Tell me about that.

Maulin Patel:
Fantastic question. So in order to build, let’s say an AI application, a customer has to deal with a lot of complexities. For example, you start by gathering the data, cleaning the data, extracting the features, then you have to build a model, then you have to provision resources to train the model. And after the model is fully trained, you have to put it in production with the live traffic and monitor its performance. This requires a very sophisticated platform that allows you to do this at scale and makes the entire process simple, easy, repeatable, reproducible, and auditable.

So there are many ways in which you can do this, but from the platform point of view, I will broadly classify two main approaches. So let’s say if you are a customer and you are looking for one-stop shop, comprehensive end-to-end MLOps platform, which is feature rich, which gives you the greatest and latest and which cuts down your learning curve and makes you productive almost instantly. That will be the platform, we call it Vertex AI. The beauty of Vertex AI is it’s actually fully managed platform, which means as a customer you don’t have to worry about provisioning and managing and maintaining underlying resources.

All you have to focus on is like how do I get my data? How do I build my model? And in some cases you can use open source models, go live very, very fast. So that will be the route for customers who are looking for convenience. However, we also have lots of customers who want to train end-server models across multiple platforms. They may be doing it in multiple clouds or even on-prem. So what they are looking for is portability. I want to write my code once and run it everywhere. Often many customers are also looking for flexibility and full control over their entire stack. They have affinity for open source and some organizations are also very wary about vendor lock in. So for all these reasons, a lot of our customers actually choose to build their AI platform on Google Kubernetes Engine. Customers like Anthropic, Character AI, Runway, Shopify, Spotify, Snap, you name it, lots of those customers actually fit in the second category where they choose to build their own platform on Google Kubernetes engine and this gives them full freedom.

Another dimension to this is often customers have built their web serving platform on Google Kubernetes engine and now they’re venturing into AI. So it is natural for them to leverage their existing investments and build a platform that gives them consistent operations across DevOps and MLOps, so they can leverage their workflows, entire CIC pipeline, same way they have been doing for DevOps, for their MLOps. So that’s another reason why customers build their self-manage full control platform on Google Kuburnetes engine.

Dave Nicholson:
So it’s interesting when you described Vertex AI, that sounded like a pretty good value proposition and I thought we could sort of end the conversation there, but the reality is there’s still a lot of people who want to leverage infrastructure as a service in terms of both hardware and software, because they want to have that control. You mentioned portability and flexibility. How much of this do you think for GKE is customer’s desire to be able to control things and create value in their own solution versus just the psychological fear of giving up control? Is it a combination of being rational, building it yourself? I know Google is agnostic on this, but you clearly for a reason have both approaches managed and customer managed. Any thoughts on that? What’s driving this demand for continued demand for GKE?

Maulin Patel:
Yes, so great question. So there are legitimate reasons why customer would choose to build a platform on their own. We all know it’s hard and you require quite a sophisticated, talented machine learning platform teams to build their own platform. But in some cases that investment is well worth and that’s what we hear from our customers. So the reasons I outline are sometimes portability, flexibility, control, open source affinity, and also avoiding vendor locking. And these are good concerns for some of the large customers who may have footprint in multiple places, including Google Cloud and other cloud vendors as well as open source, sorry, on-prem vendors.

So for those reasons, it does make sense for those customers to take that extra effort to build their own platform. Sometimes it also helps to be in charge of your own destiny. So if you are doing cutting edge innovation and you want to be in full control of your entire stack, you want the freedom to basically implement the latest and greatest models or tools or open source frameworks as soon as they come out, then you want to be in control of your platform and the entire infrastructure. Sometimes it’s also driven by the desire to optimize the performance. So a lot of customers who spend a lot of money on GPUs and TPUs, they like to basically optimize their network performance all the way down into operating systems, security, and lot of other things. So that’s another reason they continue to invest in infrastructure as a service.

Dave Nicholson:
And again, you mentioned portability and flexibility and that’s great and Kubernetes, fantastic, easy to move things around, easy to move between clouds, between on-prem and off-prem. But why GKE? You mentioned optimizing performance. What about cost containment? What are you doing? Look, I can find GPUs and Kubernetes at every corner 711 these days, but seriously, what are you doing that’s special with GKE that people should know about?

Maulin Patel:
Fantastic question. So let me start with what are the main pain points and customer requirements when you’re building the AI platform. So if you look at today, I would highlight a few things that are must have and the top priority for most of our customers, whether you are doing GKE or even it applies to Vertex. So the pace of innovation in AI is astronomical, right?

Every day there is some new technologies coming up. So if you are in the business of AI, for you, time to market is very, very important. So you want to be the first, as much as possible, you want to launch new features and new technologies as fast as possible. The cost of GPUs and TPUs is astronomical. So again, you want to minimize the cost for training and serving. And third is obtainability GPUs and TPUs is still very challenging. So we are in the scarce resource environment. We’re finding the GPUs and TPUs still is a challenge if you want to do things at a scale.

I understand you can get it that at the corner if you’re looking for fewer, but if you’re looking in large numbers, then you need to optimize your platform so you maximize the availability and obtainability of your GPUs and TPUs. So those are the primary concerns for customers. So in the GKE team-

Dave Nicholson:
And you meant by the way, just to interject, you mentioned GPUs and TPUs. Google is going to leverage whatever the best fit for function is from a hardware perspective, right?

Maulin Patel:
Correct, yes. And the customer has full choice when they use Google Kubernetes engine, right? So you could use GPUs, you could use TPUs, you could use CPUs. We give you full freedom to pick what’s best for you. And within each category there are multiple variants. NVIDIA provides many different variants of GPUs, whether it’s L4 or H100, T4. For TPUs, we have various generations of TPUs and we offer you full control in choice what works best for your customers. Now going back to the innovations that we’re bringing to market.
So our primary focus is to help customer get best performance for dollar, and that means a couple of things. So if you are doing let’s say inference on GPUs and TPUs, then you want to make sure that your expensive GPUs and TPUs are fully utilized. And what we have found in practice in looking at our own fleet and our own internal users in Google, is that utilization of GPUs and TPUs is very, very low. And what I mean by utilization is duty cycle. So why is that happening? The reason utilization is low often, is that in order to run an online inference service, you have to account for dynamical nature of your workload.

The workload can spike when there are many users of your service and it may go down over the night or when users are busy doing something else. So how do you account for it? So Kubernetes is naturally designed to auto scale, which means when the load spikes, dynamically auto scale the cluster and the launch load shrinks, it will scale down the cluster, so you save money. However, in the world of GPUs and TPUs, how fast you can spin up a new node and how fast you can start a new workload, has been very, very slow. So in our own experience, when we were running a couple of hundred billion parameter models, downloading them on a new node used to take hours. And if your auto scaling is so slow, then you compensate it by having significant or provision capacity to be able to handle load spikes, and that increases your cost and the utilization goes down, because the majority of time those GPUs and TPUs are idle.

There are other reasons why they are idle, sometimes they’re waiting for data, sometime they’re not being used, because they’re under maintenance or there could be failure recovery scenarios as well. So in Google Kubernetes engine team, we have brought up significant innovative technologies and I would like to mention a couple of them. We have something that we launch in the last next, in April, where we allow customers to preload container images and model weights on a secondary boot desk. And by doing so, the boot time of the workload cuts down from hours to minutes, or very large model. And for smaller models it can cut down from minutes to seconds. So our Vertex platform actually runs on GKE and they actually implemented this secondary boot day space, fast workload startup technology, and for a 16 GB model, they found that they could speed up workload startup time by 29 X, which is astonishing.

So this is how you get a better utilization. So that’s the one example of innovation. We are doing similar things for training. So in the training, the challenge here is that you have a very expensive GPUs and TPUs and they’re idle, because you can’t pump the data fast enough. So we launched something called Google Cloud storage with FUSE and local caching. What it does is, it downloads the data locally on local SSD and exposes this as a file semantics. So you can take any existing model, maybe you built it on your own or you got it from internet. Typically, those models read files, so you can get those files read directly from local SSD instead of going over the network, to Google cloud storage, which means the IO speed is really, really fast and you’re keeping your expensive GPOs and TPOs busy.

Dave Nicholson:
So when you say, in this context, when you say local, you mean local to the pods and clusters that are in the GKE context, not local as in on-premises?

Maulin Patel:
Correct. I mean local is like a physical node. So local, these are attached to the nodes. So when the pod boots up, the data is locally available, which means IO speed and reading and writing is really, really fast, which means you’re keeping your GPUs and TPUs for training really, really busy, instead of waiting for data to be loaded from a network drive.

Dave Nicholson:
Excellent. Very interesting. Very interesting. In our closing moments together, can you give us, if we were to meet at a restaurant and I had just a couple of minutes with you to say, hey, give me some words of wisdom on common things that maybe are misunderstandings when thinking about deploying AI or some hot tips that you might have, what might those be?

Maulin Patel:
Yep. So great question. As we discussed, typically our customers are really motivated to cut down the time to market, to save on a cost and get the best performance. These are all the right metrics. However, in a zeal to go really, really fast, often, some other factors which are super important, are overlooked. So for example, security, reliability and day two operations, this often take a backstage and that creates problem in the long run. So what we would recommend is that when you design your platform, make sure that this very important elements like security, reliability and day two operations are not overlooked. Because you will be spending a lot of money building foundation models or any other models and that has your intellectual property. You don’t want them to be stolen or leaked when you go live with your service, you want to make sure that your service runs and it has the desired uptime so your customers, end users get the best experience when you run the service at a scale. You also want to minimize the toil on your operations team.

So day two operations become a really, really important element. And if you think about the explosion in a GPU driver, CUDA version, Kubernetes version, then nickel version and all poor mutations and combinations. So operating systems, the managing and maintaining and making them secure could be really, really tiresome. So you want to design a platform that takes of all these challenges from the get-go instead of those being an afterthought.

Dave Nicholson:
Sage advice from Maulin Patel. Where should people go to learn more about Google Cloud engine and the AI platform offerings from Google?

Maulin Patel:
Yeah, so we have a very rich documentation and websites full of self-help guides as well as best practice guidance and videos. So cloud website, look for any particular topic that you are interested in, whether it’s a AI platform, Vertex, or the things that I mentioned, GCS views with local caching, or even like some of the new technologies we have launched with secondary boot disk space, fast workload startup.

We also have some new cool work done on the Kubernetes, which we call Queue, which allows you to basically dynamically share your GPUs and TPUs between training and serving. So also among different teams. So every team gets their fair slice of GPUs and TPUs and your high priority workload gets priority and lower priority workloads can be preempted dynamically, so you get the most utilization out of your expensive resources. All of this and many more features are well documented. They’re available on our website, so please go check it out. And we are always here to help. So you can directly reach out to any of the Google Kuburnetes engine team members or Google Cloud specialists and your account teams, and we are eager and willing to help.

Dave Nicholson:
Fantastic Maulin. Maulin Patel, thank you so much for joining us and for the rest of you out there on the inner webs, stay tuned for more exciting coverage here, Six Five Summit.

Other Categories