Groq Cloud is Changing the Rules of the Game in Generative AI
In the AI age, there will be an unchanging need for compute at the fastest speed and lowest cost. Groq GM of Cloud Sunny Madra explores this topic with Six Five Media Host and The Futurum Group Chief Research Officer Dave Nicholson.
Transcript
Dave Nicholson:
Welcome back to the Six Five Summit. I’m Dave Nicholson and I’ve got a special guest, Sunny Madra, general manager of GroqCloud. Sunny, welcome. How are you?
Sunny Madra:
Awesome. Great to be here. Thanks for having me.
Dave Nicholson:
Tell us a little about Groq, for the uninitiated, for folks who… I can’t imagine who doesn’t know about Groq, but just give us a little bit about Groq and where Groq came from, what Groq does.
Sunny Madra:
Yeah, really actually, great background. Groq is a company that was started by Jonathan Ross, who was the founder and adventure of the TPU at Google. He left Google to create a similar AI chip for the rest of the world. And what Groq is today is a fully vertically integrated stack that offers tokens in the world of generative AI to developers and enterprises across the world. And we’ve really been able to enamor people by the following things are the speed at which we generate tokens, the low latency which we do it at, and then the cost at we can do it at. So that’s been so exciting with Groq this year.
Dave Nicholson:
So that full stack includes hardware acceleration, as well as full stack that developers can take advantage of. If it’s not a TPU, what do you refer to the Groq hardware accelerator apps?
Sunny Madra:
Yep. We call our chip the LPU, the language processing unit. And what’s interesting is obviously that makes sense to a lot of people for LLMs, but what people don’t always appreciate is that the LPU is actually able to do all different types of models. So it’s not limited to just large language models. And that’s really exciting.
Dave Nicholson:
If I understand correctly, it’s all about the way the math is being done on the device and all those-
Sunny Madra:
It really is. All AI is just math underneath and so it’s a chip that’s very, very good at doing math quickly.
Dave Nicholson:
So you’ve come to Groq to build the Groq cloud where people can access LPUs as a service. Is that fair to say? Tell us about GroqCloud, what you’re doing now.
Sunny Madra:
Yeah, and maybe I’ll give you a little bit of my background because it’ll help folks that are listening to this here which is…
Dave Nicholson:
Yeah. Absolutely.
Sunny Madra:
So I’ve been a serial entrepreneur and I’ve built and sold a couple companies, and I was working on my third company when I connected with Jonathan, who I’ve known since he had left Google. And really what I saw was the company had this incredible technology, which was built for all different types of use cases, but specifically for the cloud, which they had talked and thought about before. It really had an opportunity to break out. And so our skills in basically building cloud companies before combined with the great technology that Groq had built brought us together to bring GroqCloud. And so our group runs GroqCloud. You can try it out at console.Groq.com and it’s a serverless API for AI inference. So what you’re able to do is you’re able to go there, log in, get an API key, and you can basically start doing inference right away on an API. That API we have our own SDK for. It’s also OpenAI compatible. So, it’s really easy for developers that were maybe using OpenAI and want better performance or lower cost to use our inference engine to generate tokens with.
Dave Nicholson:
So talk about the growth of GroqCloud, but more importantly the growth of the developer community because that’s been pretty staggering in terms of the ramp up that you guys have experienced.
Sunny Madra:
It has been, and look, the growth of GroqCloud and the growth of developer community is one and the same in that we did a soft launch in the middle of February. In that soft launch, we went viral. And the reason we went viral is people were amazed by this great technology that was finally available for them to use on their own in a self-service capacity. Since then, we’ve had 225,000 developers sign up. So just in a short amount of time, 12 weeks. Why they’re really coming to the platform is developers are hungry for low cost tokens, and that’s because they want to make this new class of applications, but they want the tokens to be low cost and they want them to be low latency because they want applications that have that same look and feel of the best applications either on your phone or on the internet.
And if you think about inference today outside of Groq, you go to any chat site, it sort of streams at you like it’s a dial-up connection versus if you go to Google and do a web search, it’s instantaneous. And a lot of work happens there to generate those results quickly or an application. So what we really brought to developers was a low cost experience, which is always important for them, but also a low latency and high throughput experience so that they can create those same type of user interfaces and interactions for their users that they’re used to experiencing in this traditional web applications.
Dave Nicholson:
And you alluded to this idea of speed contributing to low latency. Can we double click on that a little bit? Because I think it’s important for people to understand that there are significant architectural differences between what you do with your hardware acceleration, and then out of that, the stack that developers are interacting with. There’s a lot of misconception in the marketplace I would say that NVIDIA in particular has locked up the developer community with CUDA stuff. Can you speak to that? An educated developer who’s saying, “I want the most bang for my buck, the lowest latency I can get, and I want to participate in this development community.” Why are they so drawn to Groq?
Sunny Madra:
Yeah, that’s a really good question. So firstly, we should give credit to NVIDIA creating an incredible business and CUDA, which is incredible technology. The real difference that’s emerging now is CUDA is not for the everyday developer. CUDA is for the developer that’s making large language models and they want to basically create performance gains with them by leveraging CUDA kernel specifics. What’s really happening now as generative AI continues to mature, developers are not making models, developers are consuming models. Those models get hidden behind an API, and so they’re not interacting at all with the specifics of whether it’s NVIDIA, Groq or AMD that’s running those models because it’s been abstracted. So think about it in the cloud, when you go to AWS and you ask for a machine, you’re not really caring about whether that machine is run on an Intel CPU or an AMD CPU, you’re looking for a certain compute unit to complete your task more. More so if you go to Amazon and you’re using S3, you don’t really know what type of hard disks are sitting underneath that storing your information.
So we’re seeing the same thing happen in generative AI where developers know the model they want, but they’re less concerned with what the underlying infrastructure is. And so by taking all the infrastructure that Groq had been building for seven years and abstracting it behind a simple API, it’s really allowed developers to experience the power of Groq in the same way that they were experiencing it with NVIDIA based or a AMD based solutions on other clouds. And so that’s why the communities been so excited. Now, why is speed as in throughput or latency as in timing so important? Goes back to our previous, every a hundred milliseconds that you shave off an internet search leads to hundreds of millions of dollars of extra revenue. And in the world that we live in today, when we’re used to such short attention spans, we’re used to being able to quickly go through content. We are not on dial-up internet anymore. We have high speed at home. We get the latest 5G phones. We do all of that in the pursuit of speed and so we basically bring that to generative AI with our technologies.
Dave Nicholson:
Would you say that you exclusively offer a value proposition for inference, or is it a mix of inference and training? How do you come down on that? And then under the heading of training, of course there is the foreseen million GPU model that’s going to be trained as opposed to no, no, no, no, I have my own data in my company that I want to train my model on. So how does GroqCloud fit into that model? What’s the best use case?
Sunny Madra:
Yeah, look, what I’ll say is when it comes to training, NVIDIA is best in class and still is, and they’ve done a lot of work. And honestly, their developer ecosystem, specifically the ones around, the ones that are making models, they’ve done a great job to support those folks. And so we’ve focused on inference and we focus on inference because we understand that that part of the market will be much larger than training. So if you think about it, a model will get trained once maybe over a course of several months, but get used several hundreds of millions if not billions of times by people. And so we focused on what we believe will ultimately become the bigger part of the market.
And so that’s the focus aspect. In terms of your second question, which was around the developer speed… Around, sorry, data and training. When it’s doing inference, we don’t really focus on taking any data, and that makes it easier for developers to work with us because we’re not really taking data and using it for training and potentially consuming it. In fact, we don’t store any data on our machines. Anything that comes through and from an inference perspective is completely passed through. And we’ve done that on purpose because we really want to give folks the comfort and understanding that we are not participating in any type of data capture exercises or data retention exercises.
Dave Nicholson:
Got it. So yeah, it sort of puts an exclamation point on the difference between inference and training because when you say performance, when you say speed, training, it could be how long will it take to train this model. Inference, it’s in how many milliseconds can I get a response?
Sunny Madra:
That’s exactly it.
Dave Nicholson:
You gave a Google example. It’s a classic example. We’ve all been using that predictive text. If it can’t predict the next word before you can figure it out, it’s useless. So we may be okay with the sort of experimental phase we’re in right now as consumers playing around with generative AI and waiting for four or five seconds or 10 seconds or 30 seconds, it almost seems kind of quaint. It’s kind of fun. But in the real world, you’ve got to have the kind of speed that you deliver.
Sunny Madra:
Well, that’s exactly it, and you nailed it. And I think that experience is what’s starting to emerge with folks that are making applications. And what also happening is when you think of agentic use cases, so like agent-based AI that are doing multiple tasks for you, not a single shot question and answer, those use cases typically can take several minutes because there’s a lot of back and forth that the AI is doing with whatever service it’s interacting with. So think about asking an AI agent to book a ticket for you. It’s going to have several interactions back and forth. So if those things can get sped up, it can be the difference from a use case taking four or five minutes going down to 10 seconds.
Dave Nicholson:
So not to be cynical, but basically what you’ve done is you’ve said, “Hey, you know what a very specific task with specific requirements, let’s build something that does that thing really, really well.”
Sunny Madra:
That’s it.
Dave Nicholson:
What a novel concept. But back to the hardware acceleration part of it, I understand that there are unique aspects to the decisions you’ve made about how you build these things and what your supply chain looks like. Is that relevant from your perspective in the cloud conversation or is that something people should look up online?
Sunny Madra:
No, it is, and we can touch on it quickly. I think the following aspects, I think one, we’ve built a chip that does not use any external memory. So there’s no HBM. So from a supply chain standpoint, HBM is sold out well into a year in advance. And so if you were trying to build something and you didn’t already have access to supply, it’d be very hard for you. Second, are actually chips are a 14 nanometer process, which is four generations old. And I think because of that, one, it shows the incredible work that our engineers did to create such a powerful chip that’s competitive with something chips that are four generations newer, but more importantly, it really unlocks supply chain for us.
The fab capacity of 14 nanometer is easily available compared to the fab capacity of the latest technologies. And then lastly, which something we’re really proud about, our chips are fabbed in North America, in fact in upstate New York. And so being able to do that, we really don’t have to really get at risk of any kind of supply chain constraints that can happen over shipping or conflicts or anything like that that can happen. So that really gives us the ability to scale our cloud really, really fast compared to say anybody else that would try to do it with some of the supply chain constraints.
Dave Nicholson:
Yes, some would say, “Oh, Dave, those things don’t matter. It’s a service. You shouldn’t care what’s happening behind the scenes.” But it’s so important. I wanted to ask you about that. Well, Sunny, finally, what do you think is coming down the line in terms of the future of generative AI in particular? Anything that excites you specifically personally or professionally?
Sunny Madra:
Yeah. No, there are, and there’s the following things, which I’m very excited by. I think first is multimodal AI. And we started to see that with GPT-4.0 and seeing that in the open source community, which I’m pretty confident we’ll see that in the next 90 days is going to be really powerful. So that’s where multimodal means you can interact with the same model using voice, images or text, and I think that’s going to be incredibly powerful. The second thing that I’m super excited by is these agentic use cases. And why I am excited by that is today we’re living in the era where we’re taking human tasks and replacing them with a single AI driven task. And what I really think about is sort of what happened in the industrial revolution. We went from say bespoke car making one car at a time to factories that made cars.
And now what we can do is we can basically say, instead of replacing a single task with an agent, why don’t I have a hundred agents look at my trip options and then have another agent look at the results they come up with, because that’s way more powerful. So we’re kind of going to the industrial revolution of technology. We haven’t seen that before. So that’s super exciting. And then lastly, I think just sort of the advancements by the open source community. There’s been so much rallying by different companies, definitely led by Meta, but a lot of different companies to make the open source more available and grow it quickly. And just seeing, and look, I’ve been a big benefactor of open source through my career in startups so seeing the companies rally around open source and so they’re not being a winner take all in a single company, that’s really exciting. Those are the things that I really think about a lot.
Dave Nicholson:
Very interesting. Thanks for those thoughts. Sunny Madre, general manager of GroqCloud, thanks so much for being with us here at Six Five Summit. For the rest of you, stay tuned. Much more to come.