How a Data Diet can Help to Achieve Essential Sustainability Targets

Cohesity has analyzed the issues faced by the exponential growth of data worldwide, resulting from cheaper storage, cloud growth, and AI. Data center energy and efficiency are not keeping pace.

● In Europe, electricity is still generated mainly using fossil fuels; the energy required for the data economy produces large quantities of the most significant GHG, CO2
● Modern data centers consumed 0.3 kW/h per billed GByte in 2018. In 2020, 29.8 TW/h were used
● In OECD Europe, up to 217.98 million cubic meters of water were used in 2020. For 2030, researchers for OECD Europe expect:
○ With 112.7 TW/h of energy, 547 million cubic meters of water are needed
○ Water consumption in Europe will increase by 250 percent in ten years
○ Modern hardware is optimised for power, however new use cases are pushing these platforms, AI for example, its hardware requirement is data and power-hungry. instead of 8.4 kw/h per rack, an AI rack needs up to 30 kw/h

The amount of data is already growing by an average of 50 percent per year in more than half of all companies, and the majority of organizations have infrastructure crammed with data, where 70 percent of the content is unknown on average. This uncontrolled growth has led to new terms being defined. The General Conference on Weights and Measures has expanded units of measurement for the first time since 1991 – Quetta and Ronna have now replaced Yotta as the largest unit prefix. A Quettabyte is a quintillion, 1000 Yottabytes (the largest scale so far) is 1000 Ronnabytes, and 1000 Ronnabytes is a Quettabyte.

“We are now living in the quettabyte era”, said Mark Molyneux, EMEA CTO at Cohesity. “However, there are actions that organizations can take today to reduce their data volume and undergo a ‘data diet,’ such as indexing and classifying data according to its content and value for the company. Everything that is without value can be deleted. Obsolete data, duplicates of systems, orphans, outdated test systems. You can also reduce data volumes using technology such as DeDuplication and Compression to eliminate redundant copies and automatically replace original data with a thin version, achieving reduction rates of up to 97% are possible. Classification according to your Relevant Records policy enables data owners to make the right decisions. This will allow Defensible Deletion Decisions, which you need to make but have been unable to do due to a lack of data intelligence. You will keep only what you need for the prescribed period and then automatically delete it. This will reduce your mountains of data, it will also give you vital intelligence when you experience a Cyber Event and need to know what has been compromised, or encrypted, or taken. AI and machine learning can truly enable the defuse of complex problems, and their LLMs are empowered by solid data.

This new infrastructure’s exponentially growing energy hunger runs counter to the political goals and objectives of numerous global and European initiatives such as COP26 and the “European Green Deal” of 2020. The aim of the Green Deal is to make Europe climate-neutral by 2050. This initiative is being driven forward with the “European Digital Strategy” to ensure that data centers are climate-neutral by 2030. The International Energy Agency says emissions from data centers worldwide must be halved by 2030. This was before the sudden expansion of AI started to push computing and data volume.

While AI is data and power-hungry, alongside machine learning, it can help defuse one of the most complex problems, “unknown” data. Predefined filters immediately fish compliance-relevant data such as credit cards or other personal details out of the data pool and mark them. Once loose on the data, the AI develops a company-related language, a company dialect. And the longer it works and the more company data it examines, the more accurate its results become. Companies will be enabled to automatically identify obsolete, orphaned, and redundant data that could be deleted immediately.

Transcript

Cory Johnson:
We’re joined right now by Mark Molyneux, he’s the CTO of an interesting company called Cohesity. Facing one of the most interesting problems in our society right now, yet that big, these ideas about sustainability, data centers, and the ecological pain that AI could give us. Mark, so glad to have you on. First of all, tell us about what Cohesity is.

Mark Molyneux:
Yeah. So Cohesity is a leading company in data security, data management, and AI. We’ve got a mission to protect the world’s data and I think we’re doing a very good job of it.

Cory Johnson:
Well, I certainly hope so. It’s a big job. But the first thing that comes to mind when we talk about protecting data isn’t sustainability, our environment, the uses of water and power, and climate change. Explain to me why this is such a concern in general for data centers and then let’s go on and talk a little bit more about how AI changes that.

Mark Molyneux:
Yeah. So I think the challenge around sustainability is that most companies today don’t actually figure sustainability as a big picture item within their vision and strategy specifically for data. So they have it in their company agenda, they have it in their goals to do things like reduce energy, turn off lights, et cetera. And some go a stage further and actually start properly looking at scope three emissions and looking at downstreams, providers and transport and things like that. What they don’t do is look at the data. So what we’re seeing is we’re seeing data grow by 50% exponentially every year, and that data content isn’t categorized.

People don’t necessarily understand what that data is and why they’re keeping it and it grows and grows and grows. And because they don’t understand much about that data, they didn’t do anything with the data. So if you think about that in data center terms, that’s filling up storage, that’s filling up media, that’s filling up floor space. It requires electricity to run the technology. It requires water to cool the technology. All the technology produces CO2, which obviously contributes to greenhouse gases, which is one of the aims that the company as a bigger target is trying to reduce. So what we’re seeing is we’re seeing data growing exponentially with very little care taken about it and that directly impacting what a company’s trying to do for its sustainability goals.

Cory Johnson:
But isn’t the promise of data that the data has value and you don’t know which data has value or that AI will give you the ability to access and extract value from data that you didn’t know was valuable. Old credit card receipts from customers that you may have thought you lost, now you’ve realized there’s something in the patterns that may tell you something about your business. And the best thing to do, we were told, is keep that data around, you might need it later.

Mark Molyneux:
Absolutely. Yeah. So that’s the key promise behind keeping the data. Everybody keeps the data because it’s this big ticket item that they can then get insights from.

Cory Johnson:
In theory?

Mark Molyneux:
Yeah, in theory. But it’s also, in fact, that data is valuable. It can provide those insights that you need it to provide across a variety of use cases. However, AI isn’t a magic wand. It doesn’t just work, you have to train it. And unless you know what the data is that you are putting in, the language models that AI use are only as good as the data that goes into them. You have to train them effectively, otherwise they hallucinate. They don’t bring back the correct results so it’s largely useless. So to be able to make AI useful, you have to understand what the data is. And if you don’t understand what the data is, you’ve got this circle now of these things just won’t work because you don’t understand what that data content is.

So we all come back to understanding what the data is. Now, AI itself drives exponential use of technology. So as I said a moment ago, technology gets more and more intensely used as data goes onto it. If you look at it from an AI perspective, your average server rack is probably using eight kilowatts per hour. Well, AI is 30 kilowatts per hour. So it’s a huge difference when you’re starting to push AI through that. We also know that from research that your average Google search is about 0.3 kilowatts, but if you are doing an interaction with a large language model, Alphabet’s chairman said it was going to be 10 times the value. So you’re now talking three kilowatts per large language search, ChatGPTs responding to 195 million of these a day. So now you start to see 560 plus megawatts of electricity are now being consumed by AI. So AI has a dramatic effect on what you do with sustainability, but it still all comes back to the point that we were mentioning a moment ago around data. If you don’t understand what that data is, how useful and valuable can that data be to you? And this is partly where we’re advocating as Cohesity that you need to understand far more about your data. We’re coming to the party.

Cory Johnson:
It seems that you’re also arguing, you got to get rid of data. If you don’t know what it is, toss it.

Mark Molyneux:
Correct.

Cory Johnson:
If you haven’t used it lately, toss it.

Mark Molyneux:
That’s exactly what it is. Yeah. I mean the term that we use for it is defensible deletion. So can you make a defensible deletion decision against that data? So you are probably holding Corey’s shopping list from 10 years ago, maybe pictures of your dog, videos of an old conference call that you did that you needed to keep for a few weeks. There’s no intelligence around how much data is kept on an individual basis and then if you expand that out to a department or to a company, it’s huge amounts of data that’s kept hand over the fist that isn’t needed. Now every company has a relevant record strategy, so a strategy where a record for a particular purpose is kept for a particular length of time. What we don’t see, especially in unstructured data, is anyone categorizing the data. So classification and indexing of data is absolutely critical.

If you classify your data, first of all, if you index it so you know where it is, if you classify that data and then say, well, I know that that’s Corey’s shopping list, but I also know that this is a mortgage record or this is dental records, healthcare, this is a wing blueprint from an airplane that was made 10 years ago. These are all relevant records that have to be kept for a period of time. And in some cases they have to be kept immutable for that set period of time before they can be deleted. But you can start bucketing up in this record strategy to create this defensible deletion to be able to say, I can make a decision about that data and not keep it. And this is where you start to reduce that volume. This is where you start going on a day diet. This is where you start reducing the volumes of data, which then contribute to what you are doing with sustainability directly.

Cory Johnson:
It’s a fascinating conversation. In preparing for our interview today, I could not help but think of one of my ex-wife’s stories. And I don’t like to tell these public but I’ll share one with you. She saw an episode of Oprah that said, “If you haven’t used something in three years, throw it away”. And then went through and threw away all of our old tax records and bank records because she had never used them. And so her decision facing that pile of data was, this is useless data. My position when I returned home and found out that the garbage man had taken away the many years of tax aid and banking data was different, we’ll just say to keep my language out of the four letter word zone, my take on the data that someone else decided to toss was very different. How does an organization deal with that? Understanding they’ve got sustainability and climate goals. They want to limit how much data they’re paying for to store, but recognizing that somebody else might have a different view of the value of that data.

Mark Molyneux:
That’s it exactly. I mean, that’s where the relevant record strategy comes in every company –

Cory Johnson:
And I’m not asking for marital advice.

Mark Molyneux:
No, no.

Cory Johnson:
But it kind of is.

Mark Molyneux:
I’ve been married for 30 years, so I wouldn’t even dream of giving it. Yeah. Every company should have a relevant record strategy, and if they don’t, they should be creating one. And this applies to every record within their business that has a material value. So every business unit will know what’s valuable to them. And in some cases, depending on the industry you’re in, it’s prescribed. If you’re in financial services, you know are keeping those loan records and tax records and mortgage records, et cetera for set periods of time. That’s your relevant record strategy. What I tend to see when I’m talking to customers is there’s no correlation between the relevant record strategy at the top of the company that most employees do training for and attest to every year to what actually happens down at the bottom of the company in a backup strategy. What tends to happen is the backup group are told to backup up the servers or backup up this storage array or backup up this piece of data.

There’s no correlation between the backup policies and the relevant record strategy. And if there was, they would know exactly where all that data was within backup. They would know where it was within storage because it would be clearly marked, it would be classified and it would be indexed. Now, you can materially get value from that immediately through reporting anyway. Cohesity’s product, for example, reports on data content. It reports on ownership, last access, size, type of data, and also elements within the data PTWO data for example. So you can already go a stage with backup and recovery if you like. If you then put an AI layer over that because you’ve classified an index out of that data, you can now use AI to train your large language models or to use something like retrieval log mentioned generation to actually create a bucket of that data that you can then query with natural language. So you can go in there and say, “Hey, I want Corey’s tax records for the year 2018”, and it’ll come back with everything. Or it’ll come back that time you under paid your tax between month one and month six, and it’ll come back with that. So you’ve enabled that data mounting to start materially giving you really strong insights into what you’ve got on the shop floor. Then you can make a defensible decision against what you do with that data.

Cory Johnson:
And there are also legal ramifications about what data is kept and what isn’t. I think the financial services industry is very strong about this only because they’ve gotten in so much trouble from eliminating messages, I’m thinking particularly about the messages about traders and things that have led to big lawsuits over the years. So they have some really rigorous policies about what can be destroyed and what cannot.

Mark Molyneux:
Absolutely. Yeah. I mean, tick data, all sorts of things is kept as part of financial services. I was lucky enough to be part of the 2008 financial crisis. So I was on a group where we were actually doing data retention, yeah lucky me, and we were in a retain all mode because we really didn’t know what data we needed to keep at that time and how long we needed to keep it for. So we just kept everything because that was the path of least resistance. And in the company the size of a major financial, it’s a risk weighted decision to say, well, okay, the risk of not having this data and being found by the regulator is actually lower than the risk of keeping all this data and just spending money on it. Remember back in 2008, sustainability wasn’t the big ticket item that it was today.

It was only after COP26 and the Paris Accord that we started to see companies worldwide signing up to their government strategy for sustainability goals and agenda. Now it’s a material difference. Now, you can’t just keep going out buying data centers, filling it with storage and filling it with data. You’ve got to be intelligent about the way you deal with that and how you classify and index that data and tag it for AI use because then you can get genuine insights from it. And as I was saying, it’s not just about the insights that you can get from the data are hugely strong for your business, but as you said, it can go back to the regulator. If you have a security breach, you have to report to the SEC within 72 hours.

Cory Johnson:
Well, I’m thinking of 2001 when there was an investment banker in Silicon Valley, I’m pointing to Silicon Valley behind me here. There’s an investment banker in Silicon Valley who upon being informed that the firm was under investigation for some potentially bad practices of paying kickbacks to executives to steer banking business their way, instantly told one of his colleagues to send out an email to everyone saying, “Remember our policy, delete all your old emails right now”. And that became a crux of the case.

Mark Molyneux:
Yeah. And that’s it. And that’s if you had a proper strategy in place that linked your relevant records to your backup policies, the guy there in that example could have gone and had everyone delete their emails and it wouldn’t have made a difference because there would’ve been a copy in an indexed fashion that could be queried. And in modern times now with AI, you can query it with natural language.

Cory Johnson:
Let me ask you finally, it seems that there’s always going to be an inherent conflict between the idea of storing up lots of data because it has value or potential value down the road and not storing data because there’s the cost of doing so is becoming greater, and as you point out 10 X and AI. Is there a simple solution to figure out what do I do when I’m at that fork in the road? I want to keep it, I need to toss it. What do I do?

Mark Molyneux:
It’s a material decision, isn’t it? I think when you are at that point, you are going to understand whether you need to delete the data because it’s obsolete, whether you need to keep it for a particular purpose. I think if you follow this mantra of… I mean, you’ve got to start somewhere, so you might as well start now. If you can start moving forward and classifying and indexing the data going forward, you can make those decisions immediately because you know are not going to keep my shopping list for the next 10 years. You’re going to keep it for five minutes and then get rid of it. So that could be a personal choice, it could be an enforced choice. But companies can put this enforced data drop within their policy so the data doesn’t exist after that period of time and you start to manage that mountain.

So if you know you are growing at 50% a year in your data, but you also know that probably only 30-40% of that whole data is actually of any value. Why wouldn’t you start making those decisions now? Why wouldn’t you start driving those numbers now going forward and then kick up something in parallel to go back and look at that other data to make those decisions against it? Because it’s all going to go towards your sustainability agenda. As we said before, if AI is going to drive 30 kilowatts per rack, you need to be starting turning some of these racks off and reducing data is the answer to that.

Cory Johnson:
All right. He’s the Marie Kondo of data. Mark Molyneux of Cohesity. Thank you very much, we appreciate your time.

Mark Molyneux:
Thanks very much.

Other Categories