How to Build a Leading Observability Practice

As engineering and ITOps teams have moved to the cloud and ramped up their pace of innovation, it often becomes harder to validate the impact of software and infrastructure changes – both to the business and to customers’ experience. Disconnected toolchains and an explosion of failure scenarios in cloud native environments have made this problem worse. As a result, it still takes too much toil, guesswork and expensive war room calls to find answers to problems or even know where to look.

Join observability leaders at Splunk for a lively discussion on what it takes to build a leading observability practice. From boots on the ground experience working with our customers, we’ll share how high-performing engineering and ITOps teams are using observability to improve their digital resilience, and how Splunk can help you achieve:

-Unified visibility across any environment and any stack
-Earlier detection & investigation of business-impacting issues
-And better control of your data and costs

Transcript

Paul Nashawaty:
Hello, and thank you for joining us at our 65 Summit AI Unleash. Welcome to the session on How to Build a Leading-Edge Observability Practice. My name is Paul Nashawaty. I’m the Practice Lead for the Application Development and Monetization practice at the Futurum Group. I’m thrilled to be here with Mala and Patrick from Splunk. Mala, would you like to introduce yourself?

Mala Pillutla:
Sure. Hi, Paul. Hi, everybody. My name is Mala Pillutla and I’m the GVP for Observability here at Splunk.

Paul Nashawaty:
Thank you, Mala. And Patrick?

Patrick Lin:
Hey there. Patrick Lin, also from Splunk. I’m the SVP and General Manager for Observability, also with Splunk, a Cisco company.

Paul Nashawaty:
I am really, really excited to have you on today’s session and to talk about the movement of observability at how it relates to AI. Today, we’ll be talking to both of you on how engineering and IT ops teams are moved from the cloud management to ramp up their pace and innovation. Also, how it becomes harder to validate the impacts of software and infrastructure changes. Let’s get started.

Mala, let’s start with you. What we saw in 2023 is a largely driven macro context where there’s a lot of customers trying to rationalize their tool stack. In fact, what we see in our own observability research, we see that 75% of respondents indicate that they’re using six to 15 observability tools to gather organizational data. Do we expect this to continue in the near term or would this be a long-term strategy look like?

Mala Pillutla:
That’s a great question, Paul. Actually, in our observability research that we’ve done, the number of tools are much greater. It’s greater than 20 monitoring tools on an average that organizations have today. My view is this, I do believe this trend is going to continue and as our customers mature their observability practices most often they’re also transforming their business models and they’re looking for more efficient ways to have a comprehensive view of the full stack across different telemetry types and also at the same time optimize cost through rationalization.

Some of the things that I hear from our customers consistently is they share similar challenges and opportunities as it relates to tool consolidation, and most often I sum it up into three things. One, what we hear often is the current macroeconomic climate is increasingly about improving operational efficiency within organizations. What does that mean? Drive more effectiveness with less solutions? Second, what I hear is it’s around the ability to be effective and have a comprehensive view across the full stack of infrastructure and application as opposed to having different tools and different versions of reality.

When they do do that, what does that do? It impacts the cost of downtime and service availability. The third most common feedback we hear is it’s about driving cost optimization through rationalizing of tools and technologies and to eliminate that redundancy. Here’s what I mean. On an average, like I said earlier, an organization can have greater than 20 monitoring tools. What does this mean? This makes it harder to diagnose when there is a system or service degradation. How do I find that root cause? Which system, which application, which network or database is causing that degradation?

In some cases, customers are missing critical signals like failure or an alert or an outage and it goes unnoticed. I do see this trend on rationalization continuing.

Paul Nashawaty:
It seems like the TCO for that approach is pretty high when you have those multiple tools. Then like you said, the integration has to work seamlessly in order to get that availability and just looking at how the systems are working. Patrick, I want to talk a little bit about when you look at the complexity and you look at the number of tools over the last couple of decades, we’ve seen a broad trend towards centralizing compute infrastructure in the form of cloud computing.

In fact, we see in our research, 94% of respondents are using two or more basically distinct cloud infrastructures and service providers. This means that they’re more recently becoming clear that the certain applications in the use cases will require the infrastructures to be really closer to where the users are and the data where they are. Essentially when they’re looking at computing at the edge, when we look at that edge and we look at clouds, what do you think, Patrick, from the perspective of the trend and what does this mean for the observability at the edge?

Patrick Lin:
Yeah, that’s a great question, Paul. I think when we talk about things at the edge, there’s a variety of situations where we see that. Sometimes it’s something as simple as, hey, it’s retail or a quick service restaurant or something like that where you fundamentally need some sort of in-store computing where it doesn’t make sense to put all of that into the public cloud. I think there’s also been a more recent trend of seeing people repatriate some of the workloads that they had previously moved into the cloud saying, “Well, actually the predictability of those means that they can now perhaps be in a different environment.”

I think that in some cases there were some things that never made their way into the cloud. This very distributed environment that you end up having probably results in a lot of cases in critical business transactions or services being delivered across this very high revised] environment. It’s not uncommon I think for us to see things like a new application that’s been built where the front end is something that’s in the public cloud, but it ultimately ties back to a system or set of services that’s still running on-prem because it never made sense to move it or because they moved it back from public cloud and whatnot.

In order for that to be observed and monitored properly, it’s pretty important to have a few different things. One of them is to have a pretty consistent way of getting the data in so that it is consistent with itself. You don’t want to have silos of data based on where a work or part of a service is being served from. A second piece here is a set of capabilities that looks at that data and is able to show you what’s going on across it in a very consistent fashion. You don’t want to have fragmented views again based on where the data’s coming from.

I think the third piece is that because of that sort of distribution, you also need greater visibility into what’s happening across the network and in many cases across the internet, depending on how the application itself is structured. It’s more important than ever to have that be included as part of the visibility that you have. Then I think the last piece is that you need to have the ability to have the visibility both in that on-prem or customer-managed environment as well as something that takes advantage of the public cloud.

I think overall, it actually is pretty consistent with what a lot of what Mala was saying around a consolidated view across things and having full stack observability across that. It is, by the way, one of the reasons why I think the acquisition of Splunk into Cisco makes so much sense because it’s a way for us to be able to provide the connections across all those different sources of information to use open telemetry as the common format for that information to come in and for us to provide the right tooling so that you can get to the root cause of issues as quickly as possible.

One last thing I’ll add to this by the way, is that I think sometimes there is a question about what data you want to be able to bring in when the infrastructure is not already centralized. I think the other piece that’s useful there is the ability to watch over the data as it’s sort of making it through the pipeline from let’s say wherever that infrastructure or application is located and where you ultimately are going to be doing your troubleshooting and monitoring and so on.

And so, having the ability to look at the data as this goes through the pipeline, deciding whether you want to keep it, drop it, aggregate it, transform it, that’s another key thing that’s important for people to understand how to use in the context of this trend toward some additional decentralization, rigt? that could be infrastructure.

Paul Nashawaty:
Yeah, there’s a lot there to unpack for sure. I mean, when you look at everything you mentioned, I can back up most of what you just described with our research and our data. When we look at application portability, 20% of respondents indicated that it’s critical that their applications are portable. One of the things you touched on was the repatriation point, and we do see that in our research when I talk about in the context of modernization, past, present and future heritage applications moving to cloud native and such.

When refactoring occurs, that refactoring, only 11% now is being done according to our research only done on-prem. When we look at two years out, that refactoring on-prem has actually gone up over 30%. There’s a repatriation coming back from the cloud. The point you were making, Patrick, about harmonization of the platforms and the tool stack, the tech stack to make sure you have that visibility across all areas is equally important.

Now, I do want to talk about that as we talk about modernization of applications. Patrick, I do want to throw another question your way. There’s a lot of interest around how security teams, developer and operational teams can benefit from the signals from each other’s domain and their practice. What do you see in or hearing from customers around how they approach this?

Patrick Lin:
Yeah, that’s another really highly interesting topic. I guess maybe the place I’d start here is first by mentioning that when we think about security and development and operations teams, oftentimes we think of them as having completely different objectives and they’re sort of gold on different outcomes. Back in the day, maybe they weren’t so separate. I think the specialization that we see in larger organizations comes from the growth over time of these individual disciplines where back in the beginning, they may have been using the same set of data.
I think that the fact that there’s a lot of information that is captured in one context that tends to be quite useful in the other. One example of that might be you often find information about assets that you have, an identities that you have in let’s say a CMBD that’s managed by an IT department or that is being brought in to an observability tool because it’s important for the development team to be able to track all the whatever containers or functions or other things that are being used.

On the flip side, I think on the security teams, they are often lacking very good visibility into assets and identities. And so being able to have that information be made available to the security teams and then be augmented with more real-time information that typically comes in from observability. That’s one example I think of how there’s some interest in bringing that data across. I think that the sort of conference of that is also true. I think that oftentimes security is part of the context that the engineering teams need to operate in.

If there’s an issue and there’s a spike in the traffic of some kind, then one of the natural questions is, am I under attack? Knowing whether something is being investigated from the security side might be useful in those cases or in a kind of less urgent scenario, oftentimes dev teams spend a good chunk of their time making sure that their applications are secure and meet various compliance needs and regulations.

To the extent there’s feeds of information that can be brought in more from the security side to help contextualize that to say, “Well, you need to prioritize your work in this way,” it’s more important to do that fix versus this one because your configuration here is actually the one that is going to be problematic. Versus, yeah, there’s something out there that indicates this area is problematic, but you haven’t set it up in a way that actually is.

Having that very specific data that helps prioritize the work and ultimately let the dev teams get back to doing what they’re supposed to be doing, that’s great information to have and to share and ultimately make the collaboration across the teams better as well because they’ll have a shared sense of reality, a shared sense of priorities.

Paul Nashawaty:
Yeah, it sounds like a lot of what you were talking about is the whole shift left kind of nomenclature and moving back to the security into the teams. When we look at that shifting of responsibilities and how things are happening, you see teams or organizations that have DevOps, SREs and platform engineering, but also DevSecOps plays into it.

Mala, when we talk about maturity, because it depends on maturity and how these organizations are driving with regards to how they organize their teams, observability in general is to some extent is really kind of an immature practice in a lot of organizations. Then there’s some organizations that really know what they’re doing and they’re really at the right end of the outliers. When we look at the rapid maturity from alerting and monitoring to really actionable insights and how it’s evolving rapidly, what guidance would you provide or have for the CIOs and CTOs as they think about their three to five year roadmaps plans for their organizations?

Mala Pillutla:
It is an evolving space and domain, observability, and it’s an exciting space to be in as well. Just like Splunk, a Cisco company, our customers are also evolving the observability practices and it’s very much a growing and thriving domain. What we’ve seen is that rapid evolution in the past few years, both in the complexity in architectures and boundaries in businesses, and the rise for the need to ensure critical business applications, be it on-prem, hybrid or cloud native, have reliability and are resilient and are also scalable.

I think one thing COVID showed us was the need to rapidly change business models to meet your customers where they are at. Our customers are increasingly operating in this complex application landscape, like I said, be it cloud applications, hybrid, on-prem architectures, and not to mention the rise of AI applications recently. There is this need for CIOs and CTOs to drive standardizing observability tooling adoption across teams due to the speed at which some of these businesses operate.

Fragmented ownership only hampers, I feel, the holistic market viability of that solution. While working with our customers, we often hear the adoption of observability when supported by product solutions across that varied application architecture and landscape. That’s when value is realized. Most of the CIOs and CTOs we talk to, there are on three value realization areas, typically. One, how do I improve developer productivity so that engineers can spend more time actually building and shipping code instead of managing and troubleshooting their tool chain?
The second one equally important is the ability to wrangle costs by maintaining governance and avoiding these runaway usage costs. Last but not least, consistent practices to building digital systems that are built right from the get-go to be observable. For CIOs and CTOs, my recommendation is it’s just not about the products or technologies. They need to develop strategic relationships with technology players in this space that can effectively partner with them with the organizational roadmap.

It is an evolving domain, like I said, and we are learning from our customers and vice versa. We expect a close vendor relationship with our customers.

Paul Nashawaty:
That makes a lot of sense. I think the biggest challenge I heard when you were talking about is when you talk to these CTOs and CIOs, that are thinking about their roadmaps, they also have skill gap issues. You gave the answer though, working with service delivery partners to help get them where they need to go will help augment their resources. Patrick, it wouldn’t be a session if we didn’t talk about AI. We have to talk about AI.

Mala talked a little bit about it, but when organizations are approaching AI, what does it mean for their observability practices?

Patrick Lin:
Paul, I’m impressed we made it 18 minutes without saying AI because it’s usually much faster for us to get to that topic. All kidding aside, I think maybe to talk about AI, it’s worth stepping back and defining what we mean by it. I think these days the hotness is all about OpenAI, ChatGPT, large language models and the incorporation of that into applications, the use of it for assistance and so on.

I think that AI more broadly defined also includes a lot of the work that’s been done over the couple of decades around machine learning and so on. I think that where I would actually start is making sure that when we’re talking about AI forward organization in the context of observability, that first of all, you’re sort of making sure that you have the data necessary to be able to identify when there are issues and to apply machine learning to that to make sure that you know when there’s something that is behaving in a way that you don’t expect.

I think the machine learning part of that is just being able to do the “What’s wrong here?” In a more sophisticated way versus saying, “Oh, when it’s above 90, that’s bad. Below 90 is good.” That’s way too simplistic for the kind of real world scenarios that all of our customers have. That’s one piece. I think that the other piece to this is about being able to take advantage of generative AI and what it’s good at for helping to narrow down the source of issues.

It’s actually pretty interesting. We’ve been doing some experimentation with this internally, and it used to be that when we first launched our products, we would give a demo, talk about how we would walk you through from one screen to another to find the root cause of an issue. I just saw a demo the other day where essentially the demo I used to give that would take, let’s say 10 minutes with me kind of pointing and clicking stuff now, could happen in less than a minute by simply asking the question of the assistant that we’ve been working on.

The fact that you could save those nine minutes there almost pays for the service and the software already. I think that’s something that folks should think about as they go forward is how do they make sure that they have the information into the system that actually allows the AI to be able to provide their better and better answers to them so they can run more efficiently and have basically everyone be an expert rather than rely on like, “Hey, this person really knows this tool, this person really knows that one. If they’re not here and there’s an incident, what do we do?”

One last thing I’ll just add is that most organizations we talk to are in some form of experimentation around actually using large language models as part of their applications as well. I think it kind of goes without saying that like every other part of the application stack, this is something that does need to have some level of observability and security practices built around it as well. Something for everyone to keep in mind as they do the fun experimentation.

Paul Nashawaty:
Absolutely. Absolutely. There is a lot for the audience to consider in this conversation, and I know we just touched on a brief topic here. There’s a lot more to consider, a lot to think about. As we come to the end of our session, I want to thank the both of you for your perspectives and insights on observability and providing guidance on the audience for their own efforts as well. I also want to thank the audience for attending our session today. Thank you and have a great day.

Other Categories