The Science behind AWS GenAI

How is AWS making generative AI more accessible and cost-effective? Jason Andersen, Vice President and Principal Analyst, Moor Insights & Strategy is joined by Amazon Web ServicesSherry Marcus, Director, Bedrock Science for a look at the science powering AWS’s GenAI capabilities on this episode of Six Five at AWS re:Invent.

Tune in for details on ⤵️

  • AWS’s internal best practices for building and deploying GenAI capabilities for their customers
  • The role of data science in providing a scalable, secure, and resilient three-layer stack for GenAI at AWS
  • The strategic focus on multi-agent collaboration and model distillation within Amazon Bedrock to facilitate the creation of high-accuracy, low-latency AI models at reduced costs
  • Insights into AWS’s process for starting GenAI projects, selecting models for distillation, and the importance of autonomous agents in the future of GenAI
  • An overview of how model routing enhances AWS’s GenAI offerings, ensuring efficiency and effectiveness in the deployment of AI solutions

Learn more at Amazon Web Services.

Watch all of our coverage at Six Five Media at AWS re:Invent, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Transcript

Jason Andersen: Hello and welcome to this episode of Six Five On The Road. I’m Jason Andersen, Vice President and Principal Analyst – Applications, Application Platforms and DevOps, and today we’re at AWS re:Invent 2024 and we’re talking about Bedrock with Sherry Marcus. Hi, Sherry.

Sherry Marcus: Hi. Thank you so much. It’s great to be here today.

Jason Andersen: Oh, great. Thank you. Okay, well let’s just start with a primer for people.

Sherry Marcus: Yes.

Jason Andersen: So what is Bedrock? And it was a bit of a visionary platform when it came out about a year or so ago, so maybe what’s your take on the market today?

Sherry Marcus: Sure. So Amazon Bedrock is AWS’ generative AI service where customers can choose from one of many models and create a generative AI application for purpose. So for example, they can pick any of the Llama models or the Anthropic models and then add different functionalities such as RAG, retrieval augmented generation, to enrich the models with database information as well as guardrails to ensure different types of privacy and PII as well as agents. And so all of these functions are wrapped into a very easy to use application on Bedrock. And my role in Bedrock is I lead science and AWS Bedrock is very much of a science led product.

And so we are developing all of the great components within Bedrock such as guardrails, agents, distillation, and so forth. So what I would say is the biggest change that’s happened in the last year is a few things. One, customers truly do want model choice. They’re no longer using a single model and they’re very concerned about three things. They’re concerned about cost, they’re concerned about accuracy, and they’re concerned about latency. And so when we build up these applications for customers, we’re trying to hit a golden triangle for them to optimize on these three points.

Jason Andersen: Okay, well that’s an important point. So let’s talk about the newest releases.

Sherry Marcus: Yes.

Jason Andersen: In this particular set of releases, you have a couple of really powerful capabilities in terms of multi-agent collaboration?

Sherry Marcus: Yes.

Jason Andersen: And something called Model Distillation, right?

Sherry Marcus: Yes.

Jason Andersen: Can you talk a little bit about both of those features?

Sherry Marcus: Yes. Let me start with Model Distillation. So what is Model Distillation? Model Distillation is taking a large model such as a Llama 405B, and using it as a teacher to train a smaller model, such a Llama 8B model for specific tasks. And why do customers want that? Well, it is much more cost-effective to run a smaller model, cheaper given the same accuracy associated with it. So that’s why Distillation is an important thing. And going back to the three elements, less cost, and same accuracy. Okay, so that’s Distillation.

Now, agents, multi collaborative agents is really to reduce the complexity required in building out these complicated applications. So for example, for investment banking, you are trying to determine how to make certain types of investments within a single market, let’s say energy. And what you want to be able to do as a user, as an analyst, is you want to be able to type a very simple query natural language, “How is the price of solar panels going to be affected by the geopolitical climate in the following countries, given that there won’t be subsidies anymore?” And that ultimately will affect the stock price of the solar industry.

And so what a person would have to do is go to various SQL databases, read many different trade news. And so what multi-agent collaboration allows is you have very specialized agents each specific to a given function. So you can have agents that are very specific to carrying out certain SQL queries about prices of solar panels, retrieve it, and then you can have agents that are very specific about retrieving news from trade magazines. And then you have a master agent who’s bringing it all together into a single point of view for the analyst. So agents very much are an efficiency driver, number one, and it’s democratization because it allows someone who doesn’t necessarily need to know SQL or need to know complicated things to access data. They can just access things through natural language.

Jason Andersen: So that’s a lot of stuff.

Sherry Marcus: Yes.

Jason Andersen: I guess if we go back to your golden triangle that you just described a few minutes ago, how do you determine how to prioritize all these different types of capabilities in a market with AI and LLMs is still very fluid.

Sherry Marcus: Yes.

Jason Andersen: So how do you kind of pick?

Sherry Marcus: So the first thing we really do is we work back to our customer’s problems. And for many customers, their business requirements don’t require a big model, but they do want the accuracy, very high accuracy for their customers. Like call centers, you don’t necessarily need large models to run those types of results. So in those cases we use Distillation, let’s give the customers a small model, but at the best possible accuracy we can for the tasks. Whereas if you’re in a situation where you require very high accuracy, let’s say a trading situation where people are looking for very high accuracy, government, then we would optimize for accuracy. Sometimes we would use the smaller model if it was exactly as comparable to, and sometimes we would add additional features to increase accuracy.

Jason Andersen: So there’s a lot of models out there. Somebody just was talking recently about there being 1800 models that are available. In terms of the Distillation capabilities, how do you pick the models? How do you make sure that you’re delivering that consistent or at least the best customer experience you can when there’s so many models to choose from? And you know what I mean? How do you kind of narrow the field a bit?

Sherry Marcus: Yeah, so we tend to start with models that customers have strong affinity and adoption towards. So the Llama class models are in our Distillation release. Claude, Anthropic models is in our Distillation release and will soon have as well our own new Nova models in Distillation.

Jason Andersen: When you look at agents, there’s some folks who think that in order for something to be called an agent, it has to be a hundred percent autonomous. Now that’s not necessarily the widely held view, but I was just curious in terms of what AWS thinks and what AWS’s point of view is on agents and particularly agent autonomy as we head down the road into this kind of agentic future.

Sherry Marcus: Yeah, that’s a great question. We don’t have a specific position on it. Agents, like everything, is a capability that we provide customers and all an agent really is is you have an LLM and you ask it a question like, “What was the price of solar panels?” And it’ll call an agent with that prompt to retrieve that information and then the LLM will provide a nice summary. But really it’s an API that uses tools and may have long-term memory to remember customer transactions. But I think we’re a ways personally from truly autonomous agents just for very complex systems. I think for simple systems, we certainly can move more towards autonomous agents.

Jason Andersen: In some ways, I think there’s people out there trying to kind of replace the thought of robotic process actions with agents and they’re distinct, right?

Sherry Marcus: They are.

Jason Andersen: So I think that’s kind of part of the learning journey everybody’s on is that these things can work together. They don’t have to be replacements for each other or anything like that. One of the other things that’s interesting, speaking of kind of autonomous types of actions if you will, there’s also a new capability called routing, a new model routing capability. Can you talk to us a little bit about that?

Sherry Marcus: Sure. So if a customer enters a prompt, we have a new routing capability which enables us to be able to choose a model that will be the smallest possible model that will answer that specific query. So this type of automatic model routing, again as a cost driver happens pre-inference, that the prompt is determined to be optimized for a specific model and it’s really very useful again to drive down costs. And we found that perhaps 40% of all prompts are very simple prompts that really can be answered by the simpler models versus the more complex models. And so as a result of that, it’s a very effective tool to drive down costs for customers.

Jason Andersen: So Sherry, one of the other features I saw with Bedrock that I thought was super interesting was this idea of model evaluation.

Sherry Marcus: Yes.

Jason Andersen: Can you explain that a little bit? Can you tell us what’s going on there?

Sherry Marcus: A hundred percent. So many customers, and we talked about model choice, and customers want a concrete way to evaluate which model is best for my business use. Customers will often look at the public benchmarks, but unfortunately or fortunately, the public benchmarks may not be the most reliable source for customers to evaluate their specific requirements on. So as a result, we created a new work and new features in model evaluation. So the first is just being able to evaluate models. And this is, customers are able to select some example data sets. And we will be able to synthetically generate similar types of examples from the customer so they don’t have to generate bazillions of different data sets against many, many different metrics and features that are interesting for their specific use cases. And we’ll come out with a customized evaluation for them.

Now we do this specifically for the model, but we also do this on Bedrock Knowledge Bases, so from an end-to-end perspective. So as customers begin to use RAG, use agents, they’re going to be able to evaluate from an end-to-end perspective how good their model performs in the application scenario that they’re looking for. And again, all of this is to drive model choice and enable the best possible model for the customer.

Jason Andersen: So you said it was the data, but do you also include sample prompts in that too? Or is it just you’re looking at the data sets and the integrations going in?

Sherry Marcus: So we look at the sample prompts as well for different types of agent, different types of application scenarios as well as for the models, just looking at customer prompts if they were just going to run the LLM alone and then we look at an expanded set of prompts if they were going to run it as a RAG accessing different databases or other types of functions.

Jason Andersen: So one of my favorite parts when I was doing product development and product management of the job was actually internal use. I always felt like you learn a lot from customers, but really your own folks tend to beat on it a little bit more and quite honestly be a little bit more honest with the feedback. So I would just love to hear how you’re using this internally and doing different types of experimentation with these features within AWS or Amazon, if you don’t mind sharing.

Sherry Marcus: Yes. Much of the work that we’re developing is truly new, such as model routing. And so one of the questions that we had is, “How do we know what good looks like for a model router?” And so we initially decided as a metric, “Okay, we’ll flip a coin. So if it’s better than 50% of the time we get improved results with a smaller model, then it’s a good bet for the customer to use a smaller model.” But then we decided maybe that’s not the best metric. And so we went out and we looked at different kinds of metrics that would be more… And I won’t describe the specific metric, but metrics that would better be suited for this type of evaluation.

And another example is in the area of agents, agentic behavior, we will look at millions and millions of different prompts and different types of scenarios for agents to determine accuracy and are we building the right framework or role types or models. And a lot of it is very much trial and error and you learn by experimentation. There is very little, I guess the best word is like trade craft, like running a machine learning, simple machine learning classifier is something that has thousands and thousands of different person hours invested in how to do that perfectly. That doesn’t exist for gen AI and very painstaking work. And I’m personally very proud of my team for their grit and getting through to superb solutions.

Jason Andersen: Great. Well thanks for joining us, Sherry.

Sherry Marcus: Thank you.

Jason Andersen: And thank you all for tuning in to this episode of Six Five On The Road at AWS re:Invent 2024. And if you enjoyed what you saw, please check in on our socials at Six Five or click like subscribe, and have a great day. Thanks.

Other Categories