Unlocking Cloud Efficiency: AWS Reveals AI-Driven Operations

The cloud ops landscape is evolving at lightning speed ⚡ from simple web servers to serverless functions and now, the explosion of AI workloads. 🤯

Hosts Daniel Newman and Patrick Moorhead are joined by Amazon Web ServicesNandini Ramani, Vice President of Search and Cloud Ops, on this episode of Six Five On The Road at AWS re:Invent, for a conversation on AWS’s latest innovations in cloud operations, spotlighting the integration of AI and machine learning to elevate efficiency and performance.

Their discussion covers:
– The evolution of cloud operations challenges over the years and AWS’s approach to addressing them
– Lessons learned from AWS’s 17+ years in operation and their impact on the development of cloud services
– Integrating AI and machine learning in cloud operations for enhanced efficiency and performance
– An introduction to the Ops-i-tron concept and its significance in cloud operations
– Exploring the new “Explore Related” button and its role in simplifying troubleshooting across interconnected services

Learn more at Amazon Web Services.

Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.

Transcript

Patrick Moorhead: The Six Five is On The Road here in Las Vegas. We are at AWS re:Invent 2024. Dan, the conversation has been unsurprisingly about AI, different ways to slice it, just different ways to operationalize it. Just getting ready for this gigantic enterprise swell of AI that I think both our firms have estimated is a little bit off.

Daniel Newman: Yeah. Well, I think a lot of people are starting to get the fatigue of hype. And what they’re starting to look for is pragmatism. They want to understand where does this technology really help drive the enterprise. We hear these astounding numbers, 20 trillion of economic opportunity, 25 trillion. I’ve heard some gigantic numbers. You’re hearing half a trillion dollars of spend just on the chips in the next handful of years. But in the end, a lot of this starts to be about how we experience things.

Patrick Moorhead: It is, and enterprises are trying to get their data estate in line, questions about security, questions about governance. And then doing AI at scale or any scale, you have to be able to operationalize it. IT ops we cannot forget because that basically keeps everything moving. And if you don’t put that into your strategy with AI, you’re going to not be able to do this. And I can’t imagine a better person to talk about this, Nandini, welcome to The Six Five.

Nandini Ramani: Thank you for having me. Looking forward to this chat.

Patrick Moorhead: Yeah, let’s talk about IT ops at scale.

Daniel Newman: Of course, it’s been a really exciting week. It’s a fire hose and everybody kind of knows that, that is the AWS, so much engineering prowess, so much pedigree in this particular space. But yeah, as we see all this data move to the cloud, we see all these workloads, you got to keep it up and running. And I know you gave a talk, you talked about … by the way, I think you had three challenges and I always love this. What keeps you up at night? CIOs, CTOs, CISOs-

Patrick Moorhead: Operators.

Daniel Newman: They are awake at night trying to figure out how do they make ll these cool features that we like to talk about work. So give a little bit of that background of what’s keeping them awake and how’s that changing in this era?

Nandini Ramani: That’s a great question. So when you think about the complexity these days, you started off with staggering numbers just in terms of compute, the spend that’s coming with Gen AI, et cetera. So if you think about the trajectory of where we started and where we are, we had simple days where you had one box, maybe one for load balancing redundancy, you ran a web server on it and you were off to the races. Then we got to EC-II as compute, then we went to EKS, ECS. And now serverless with ephemeral little workloads that spin up, disappear. You have no idea where things are running. But it lets you scale. It lets you do so much more than you could initially. So I always ask our customers, what keeps you up at night? And it’s a trifecta, if you will. Number one, they want their operations to scale as their business scales.

Patrick Moorhead: Makes sense.

Nandini Ramani: Without having to do anything, it just needs to work out of the box. Second, they want to be able to have insights into the data and all the telemetry that’s being emitted. Whether it’s on-prem, on their own, whether it’s on EC-II instances or in some cases, multi-cloud. No matter where it resides, you want to be able to gather insights without doing any heavy lifting on ETL.

Patrick Moorhead: Right. Understand.

Nandini Ramani: Maintaining pipelines, doing all of those complicated things. Third, everybody wants automation. Can you just automate it? Give me built-in controls, fully managed. Those are sort of the three things that we try to address, and that’s what most of our services do for them. The undifferentiated heavy lifting. So our customers can just focus on their business and their end customers.

Patrick Moorhead: No, it makes sense. I mean, listen, AWS was all about simplicity. Focused initially, actually still focused on developers and builders and letting a lot of the driving to somebody else that they just didn’t want to do because it didn’t add business value. So that makes total sense. So I want to drill down a little bit into the history and what you’ve learned over the past 17 years that have given input into the products that you’ve chosen and the services that you’ve chosen to deliver at scale operations.

Nandini Ramani: It’s exactly the 17 years of experience. And remember, we were the first cloud. So we’ve got 17 years to build things for ourselves. That’s how it started, AWS was born because Amazon was scaling. And we needed all the challenges that I outlined came from learning of 17 years of building it for ourselves. In fact, systems manager, which is one of our services, was built so we could maintain our own instances and keep Amazon retail running. That’s how it started out. And now we externalize it. And that’s typically what we do. In fact, if you take CloudWatch, our flagship observability product service, we use it heavily internally.

In fact, you can find developers both across Amazon and AWS, pouring over dashboards, troubleshooting to make sure that we are ready and always available for our end customers who rely on us. So it is that 17 years of experience that has helped us to it. And in fact, we ourselves use our tooling in a similar fashion internally. And the second thing I would say is, I have never found a company where we listen to our customers so closely. 90% of our roadmap is driven from customer requests. That’s what we do, we have these operating plans that we build. They’re entirely based off of the customer requests. I think it’s those two things, our own experience and what customers want.

Daniel Newman: So you heard us in the preamble talking about AI and the acceleration, and you sort of alluded to it, because you were giving a little bit of the histrionics of going from web server to container to serverless. And AI’s kind of doing the same thing. We’ve had this era of sort of data and data management, and then we had this machine learning era, and now we have the AI era. And cloud operations has to follow this. How are you integrating cloud operations into … I don’t know, the last two days of announcements which are almost all built on a combination of managed AI services and self-built AI services that enterprises are really just beginning to adopt?

Nandini Ramani: Yeah. And I think Swami said it in this keynote today, and it’s true. We’ve always had it, if you think about anomaly detection, we’ve always had the journey of AI ML. And now Gen AI, where it’s the ability to reason. So first of all, Amazon Q Developer, our flagship product, think of it as the one and only service that fully understands AWS. That is powerful. You don’t have to go to run books, you don’t need to call support. Like you can ask it any question about AWS and it gives you an answer. So that’s, in and itself, it’s already powerful. But on the journey of Gen AI, last year we released a natural language querying, also powered by Gen AI behind the covers. Because every tool has its own query language, SQL this, PPL that, and so on and so forth. So we’ve had that integrated in config, CloudTrail, CloudWatch, OpenSearch, all of them now support natural language. And we’ve received tremendous feedback, it saves a lot of time for developers instead of typing queries.

This year, Matt Garman announced in his keynote, ability to do operational investigations with Q Developer. So what it does, you ask Q Developer a question, it brings you to the CloudWatch console and you can start troubleshooting. We have 17 years of experience on our own services. So we built a knowledge graph based on those learnings of so many years of what customer behavior patterns, how do they use our services. And so it’s able to let you know … once you turn the investigation on, it’ll traverse and tell you, what the likely cause is. And the typical causes for things going wrong, deployment, configuration changes. Or in some cases, load balancers, auto-scaling, those sort of problems. So it can pinpoint, it builds the topology for you, pinpoints where the problem is. And if you accept it, you can even remediate it in place.

Now Matt also alluded to the fact that currently there are hallucinations, and it’s not a hundred percent yet. But we’re working hard with automated reasoning to make sure that it is solid and it can correct itself as it builds up the confidence. So that capability is available today for everyone to use in preview. And in fact, like I said earlier, we always eat … I like to say we sip our own champagne. But basically we use it internally. And in fact, Amazon Kindle support team has used Q Investigations and they have saved 65 to 80% in troubleshooting time. That is phenomenal. Think about the possibilities when people start using this at scale. And I truly believe, just like today, we don’t talk about anomaly detection as a thing. Gen AI will just be part of everything we do.

Patrick Moorhead: Yeah, so a couple themes I’m picking up so far here is first off, customer zero. Amazon, but also when it comes to cloud operations, your customer zero for all this … Q, by the way, when Matt got up on stage and showed all the operational stuff you could do with Q, I thought that was pretty cool and pretty amazing. And there’s also the at scale part. But I do have to ask you, in your talk, this word phraseology came up that I thought was pretty cool. Opsitron. I hope I’m saying it correctly.

Nandini Ramani: You are saying it right.

Patrick Moorhead: What is Opsitron?

Nandini Ramani: So I mean, it’s a pun on the fact that we run cloud operations. So Ops and cloud Ops, and we came up with … I didn’t, to be honest.

Patrick Moorhead: What’s the tron? Is it-

Nandini Ramani: It’s a made up thing for ourselves.

Patrick Moorhead: It’s like a verb. It’s a verb.

Nandini Ramani: It’s a verb. But now it’s going to be for us, anyway. But the idea is-

Patrick Moorhead: It’s a movie.

Nandini Ramani: We build individual services. Like I talked about Systems Manager, which can do node management, CloudWatch does observability. Individually these things are very powerful. Whether there are some metrics logs. So what we came up with is each of our individual services are powerful in themselves. I think it’s a quote from fourth century BC from Aristotle. The sum is greater than its parts. But I was like, I don’t want to use that analogy. So they came up with this fun new contemporary way of saying, individual bots. The whole theme for the talk was, we have a metrics bot and a logs bot and so on and so forth. But when they come together, they become even more powerful and help you troubleshoot much faster.

Patrick Moorhead: Yeah, it’s like the Wonder Twins unite. I’ve been here, I watch the cartoons.

Nandini Ramani: But it’s a theme for us because just like I said, we do the undifferentiated heavy lifting. This is another thing we want to do. We don’t want you … the customer shouldn’t have to stitch all this information together. We want to do it for them. So we thought it’d be fun and it seems to resonate, so we got the theme of Opsitron.

Daniel Newman: Tell us a little bit more about that though, the kind of explore-related button. The demo looks like it’s basically stitching services together and making the observability or observable nature of all-

Nandini Ramani: Much easier. So Gen AI is still early. And so we’ve also built a contextual graph within CloudWatch which pulls up, that’s the one you’re alluding to, so you don’t have to type anything. Just click, point and click and it guides you through the topology. It points to where the issue is and it takes you all the way from metrics to logs. Which is usually the hard part of doing troubleshooting. Like it’s like looking for needle in a haystack. Multiple times it’s looking for a particular needle in a particular haystack.

Or how many needles in a haystack. I can expand on those analogies, but that is the part that’s so hard for folks to do. So picture, you have the contextual thing guiding you through the telemetry and you have this investigation assistant. If this answer that you are deriving as humans, which is what we do today, aligns with what you’re seeing with Gen AI, it improves the confidence and it improves the learning capability of the service, the operational assistant. I think that combination is going to be amazing.

Patrick Moorhead: Yeah. So another scenario was troubleshooting. And I think it was this, now that we’re talking about mashups between CloudWatch and APM, can you talk us through a little bit about that? The need, the value, the benefit.

Nandini Ramani: Yeah. So while we live in infrastructure land all the time, and that’s our world.

Patrick Moorhead: Infrastructure’s cool.

Nandini Ramani: I think it’s very cool, but it’s not for everyone. But our customers want to focus on their business, their application. They need all the infrastructure, they need the nodes, they need the logs, they need the telemetry. But what they really care about is, is there any latency for my end customer? Are there packet losses? Did my latest deployment cause an issue? So if it’s … they need to start at the application and this is what we do internally. We take everything as a span from every web service, and we convert that into logs and that’s how we troubleshoot. So that is the feature we’ve launched now with application Signals, bringing it together with the service that many of our customers use called X-Ray. And the latest one where you can actually go from those spans, that was the demo in the innovation talk that David showed.

Daniel Newman: Yeah. So we’ve covered a lot of ground and kind of the history of AWS and re:Invent is all about this fire hose, this funnel of announcements. So you touched on a few. Let’s do the recap. Let’s kind of end this thing a little bit on the recap. Biggest announcements in your business, what are you most excited about? What do you want all the viewers out there to take away from this conversation? As your sort of big moments from this year’s re:Invent.

Nandini Ramani: Yep. So the biggest things for me, some of the launches I already talked about, the investigations assistant. Please kick the tires on it. The other thing I would say is fault injection service. Resiliency is so important. Think about Prime Day, I can’t think of anything that needs more high availability than Prime Day. So we from AWS help retail run over 700 experiments on fault injection service. So I would encourage viewers to give that a try because resiliency is as critical as observability.

For example, we’ve launched Database Insights and CloudWatch, we’ve launched … oh, here’s a big theme that I’m super excited about. Zero ETL, remember I said customers don’t want to move their data around. So we have Zero ETL between CloudWatch and OpenSearch service. And we’ve also extended that to Security Lake and OpenSearch service. So you can run analytics. OpenSearch has very rich analytics and you can run that no matter where your logs reside. Whether it’s in security use cases or in CloudWatch, it just works seamlessly. Containers, many of our customers run their applications on containers. So we’ve launched Enhanced Container Insights for EKS last year, and this year we launched it for ECS as well. We have two new preventative policies to help you prevent drift once you set your configuration. We’ve had enhanced node management capability and systems manager. I’m just like, as you can tell, I love this stuff. So we have so many launches.

Daniel Newman: Appreciate all of the children, as we like to say-

Patrick Moorhead: She does –

Daniel Newman: No favorites, right? You don’t want to upset anybody.

Nandini Ramani: I have no favorites, I love them all.

Daniel Newman: Any of your product leaders. You want them to know you love them all.

Nandini Ramani: I love them all.

Daniel Newman: You love them all and all those customers clearly, including Amazon as Customer Zero. Appreciate it so much you spending the time here with us at re:Invent. I’m sure it’s very busy, if your feet hurt a little bit like mine do from all the steps you’re getting in.

Nandini Ramani: Love the steps, though. But wear comfy shoes.

Daniel Newman: Yeah, you wear comfy, but they’re-

Nandini Ramani: Lesson one.

Daniel Newman: Still stylish. Y’all can’t see them, but I promise you they are. And for everyone out there, I want to thank you so much for joining us here. The Six Five is On The Road at AWS re:Invent 2024 in Las Vegas. Covered a ton of ground. Subscribe, join us for all of our other content and coverage here from Pat and I from the whole Six Five team, it’s been a busy week. But we got to go for now. So we’ll see you all later.

Other Categories