The New Tier 0 from Hammerspace

✨What’s the top speed “holy grail” in tiered storage hierarchy? Tier 0! Host David Nicholson is joined by Hammerspace’s Molly Presley, Senior Vice President of Global Marketing to discuss how Hammerspace is unlocking the power of unused NVMe storage in existing GPU servers, creating a massive, high-performance data pool for AI workloads. 

Their discussion covers:

– The concept and technology behind The New Tier 0

– Hammerspace’s approach to addressing modern data challenges

– Insights into the future roadmap for Hammerspace and its technologies

– How The New Tier 0 translates to faster checkpoints, increased GPU utilization, and significant cost savings

– Real-world applications and benefits for organizations adopting The New Tier 0

Learn more at Hammerspace.

Watch the video below at Six Five Media at SC24 and be sure to subscribe to our YouTube channel, so you never miss an episode.

Transcript

David Nicholson: Welcome to SuperComputing Conference, Atlanta SC24. I’m Dave Nicholson with Six Five On the Road, and I’m here at Hammerspace with the one, the only Molly Presley from Hammerspace. Good to see you, Molly.

Molly Presley: Awesome to see you too.

David Nicholson: I want to hear what’s the latest, what’s here? I hear something about Tier 0 going on at Hammerspace. Explain to me what that’s all about.

Molly Presley: Yeah, so we just announced last week a new tier of storage that has never existed anywhere before. So industry first that we’re very excited about and our marketing campaign around it is you may be sitting on a gold mine and you don’t even know it. And what we mean by that is all these DGX servers from Nvidia, big compute nodes that are shipping, the vast majority of them have some solid-state storage in them. And that solid-state storage is largely not going used for big HPC workflows for AI training because it’s not really designed to be used as shared storage. So customers like Los Alamos, the big hyperscalers that are all doing these big AI environments have all this storage that is available to them now because the Tier 0 capabilities that Hammerspace has released.

David Nicholson: I mean, what kind of quantities are we talking about here? Is it a significant amount of storage that you can actually get-

Molly Presley: Yeah, 20, 30 petabytes in a larger scale, a lot. So you figure what would you spend on 30 petabytes of Tier 1 fast luster or whatever kind of storage, you may already own it and be able to just turn it on with a Hammerspace license.

David Nicholson: Well, what are you going to store in that Tier 0 space? It’s super high performance, but what are you going to use it for?

Molly Presley: Yeah, so it’s super high performance. And the reason people haven’t used it so far is because they couldn’t make it into a shared storage environment. So if they have a hundred nodes with a hundred different data silos, how do you train an LLM with that? Or how do your data engineers get access to it and know what it is? So we had to break down and unify that into a shared storage environment and take advantage of it being fast and also protected, because if it’s not protected, people aren’t going to put all this data that they’re spending all this money computing on drives that might fail. And compute nodes do fail. They’re hot, this happens, so we need to protect it. But what they’re using it for, checkpoints, to make checkpointing much faster. They’re using it actually to write the actual data and then share it off, so orchestrate it off to their Tier 1 or external environments as it starts to fill up. So it’s really the primary landing place for data and checkpoints.

David Nicholson: Okay. So when you talk about checkpoints, speed is really important when you’re recovering. Is that what we’re saying?

Molly Presley: Well, really, yes, but it’s more about not having the GPU sitting idle during the checkpoint process. So think of a typical compute environment. Maybe they checkpoint once an hour and that checkpoint takes five to 10 minutes. Those GPUs are idle that entire time. So you figure five or 10 minutes out of every 60, you’re talking eight, 10, 12% more GPU time. So all those GPUs can be used more effectively and run much faster or more consistently. And then of course, if you have a recovery, that recovery is fast. But first and foremost, most of these environments are thinking about how do I make my GPUs as productive as possible, and then how do I protect and use that data?

David Nicholson: You’re recovering stranded resources, which sounds like a 100% goodness. Is there a potential downside to that? Should people be concerned about churning away GPU cycles on using this storage?

Molly Presley: Oh, no, definitely not. So the downside and the reason, we’ve talked to a lot of customers, ones that are designing the really well-known tens of hundreds of thousand AI environments about why are they not using this today? And it really came down to, “I can’t have silos of data and I can’t have unreliable disks.” So when you think about writing your primary data set to storage, they want it protected with erasure coding or mirroring or something. And those SSDs without the Hammerspace Tier 0 aren’t protected. They’re just a bunch of like JBODs, and so they can’t risk losing the data. So those are the two reasons they haven’t done it. But the downside is really there isn’t any downside. They’re already configured, all that time that’s been spent setting up those GPU environments, configuring them, getting them up and running. It’s already installed and configured, and the burn-in’s done and the infant mortality has fallen out. They’re all sitting there available. They’re already powered on, they’re already on the network. So there’s really not any downside to speak of. It’s not taking away from the GPU’s memory or something like that.

David Nicholson: So under the heading of overall data orchestration, are you setting up a single large pool with these devices? Do you have options for how that’s provisioned? What does that look like in that example of a 1000 of these things aggregated together with Hammerspace? What does that look like?

Molly Presley: Yeah, so essentially Hammerspace creates a… It’s parallel file system, but it’s a parallel global file system. Each of these nodes are just members of that global file system. So as data is created on let’s say a hundred DGX nodes, all of that data is instantly, because the metadata is assimilated into the file system, it’s just part of a file system. And so those are just data sources that are aggregated into one file system and one data set. And so instantly the application, the user, the model, whatever it is that’s looking at the data, can see all of the data across those nodes as one single data set. And then if the data needs to be moved because the SSDs are getting full or because you want to move it up into the cloud for some other processing or other models, the Hammerspace data orchestration policies take over and say, “Okay, I’ve met this objective, my SSDs are full,” or, “This data set is designed to move to the cloud.” It automatically starts that move. But the applications and users are just looking at the file system and they don’t know the data’s moving around. They can see the data and know what’s there no matter which node it was creating on or where it might be getting moved to.

David Nicholson: You have this spooky thing, Molly, where you read my mind, because I was just going to ask about this.

Molly Presley: We’re connected.

David Nicholson: We are. Because whenever you talk about tiering, people immediately think about moving between tiers and when that might happen. You answered the question about moving to cloud. Is it still relevant today, this idea of moving hot data to higher performance cold data to lower performance media? Is that still something that people do, or would this Tier 0 layer tend to have certain things pinned in it, for lack of a better term?

Molly Presley: That’s a really good question. And what we have found, and I came from the days, I used to work for a tape library company. I know a lot about archive and data immediately was created and moved into the archive essentially as it aged. And that’s not really how data is being used, especially in these AI environments where you’re constantly searching for which data might have interesting information. If I correlate this data to that data, what might I find? And so you never really know which data is going to be relevant to an AI job. And so just tiering it based on age doesn’t really work. And so how these environments are working with the more modern data systems is all of the data is in a single metadata environment, no matter where it sits. So you’re isolating the metadata. So the scientists, the models can access and see all the data, and whether it’s sitting in object or tape or Tier 0, it doesn’t really matter because they can see all the data and then when the time comes, they say, “Okay, I’m going to run a job.”

That data can be moved automatically to faster storage or proximity to an application in the cloud, whatever that might be. But without disrupting the namespace, everyone can still see the data even as it’s moving around. So yeah, some people may pin data to a specific GPU. They may let all of the data sit in maybe Tier 1 or Tier 2 after it’s created and use the Tier 0 when it’s being processed. There’s a lot of different ways a customer could use the environment, depending on if they’re running active data sets, if they’re really trying to figure out what data they have and then pin the data that’s relevant in a project. But the cool thing about this is that’s all automated and done by software. So you can set objectives and have the data behave automatically the way you want. It’s not a bunch of IT guys with tickets open trying to copy data. This is automated data orchestration.

David Nicholson: So I’ve seen recently from some component vendors talk of NVMe devices, 60, 80, 100 terabytes-

Molly Presley: Right, huge.

David Nicholson: … per device. But back to the top of what we were talking about, what is the average size of those stranded NVMe devices that you’re seeing?

Molly Presley: Yeah, usually right now we’re seeing about 30 terabytes per drive, and there’s eight of them. So 240 Terabytes is a normal deployment today. And then you figure, okay-

David Nicholson: It’s completely crazy.

Molly Presley: It’s crazy.

David Nicholson: It’s completely crazy.

Molly Presley: And you don’t have to be talking about hundreds or thousands of GPU nodes for this matter. I mean, even think of just 10 GPU nodes, which is a traditional enterprise environment that’s still, you’re getting into the petabytes of data.

David Nicholson: So the concept of recovering stranded stuff in the past might’ve looked like you have a nine gig boot drive, here’s a gig to use. But the performance levels and the capacities that you’re talking about, these become massively meaningful resources. When people are spending a lot of money on GPUs, you’re essentially, if they leverage this, you’re helping them fund their AI cluster in a way, that’s fair to say?

Molly Presley: No, that’s absolutely true. So one of our Tier 0 customers who’s in production today had exactly that issue. They had already funded their AI cluster, but they were out of power. And so they had the issue of we’re out of capacity for storage, we’re out of power for our data center, but we need to keep doing AI. What do we do? Well turn on Hammerspace, and now all of a sudden you have 30 more petabytes of storage with no more power, no external networking, all the infrastructure you need. And so this just freed up the ability to keep doing more work without having to deploy storage, which they didn’t have power for. So it’s interesting that it’s not just money, but maybe you’re out of power, maybe you need cash, and maybe it’s just my GPUs aren’t as efficient as they should be, and another 15% on my GPU cycles would keep me from having to buy more GPUs.

David Nicholson: So I come from this, I come from an old knuckle-dragging storage background, and-

Molly Presley: Did we all do that?

David Nicholson: You could see the only place I don’t, yeah. And often platform upgrades were a huge issue, especially when you talk about the quantity of data we’re migrating now. We are hearing about very large data centers accelerating their refresh rates to get to latest gen CPUs to free power up so they can buy more GPUs, because to your point, finite power, what do you do?

Molly Presley: Right.

David Nicholson: Just remind-

Molly Presley: Build a nuclear power plant next door if you can.

David Nicholson: No, that’s exactly what’s happening. It’s true. It’s true. And a lot of people are going to pretend like they were never against nuclear power as you before. It’s going to be priceless to watch. But just remind people, when you’re using Hammerspace for a global file system and you’re doing things in parallel, is it fair to say that retiring a node that happens to include captive storage becomes a lot more trivial? It’s a lot easier to do.

Molly Presley: Sure. So when you retire a storage system or a node from the Hammerspace platform, it’s completely transparent to the applications and users. So that time where you may remember where you had to do it on Saturday night and notify everyone of application downtime and plan on it, and everyone was mad at you because they couldn’t get to their system. That’s completely gone because the applications, the users are interacting with the file system. And if a storage system goes offline, they can’t see it because they’re working with the metadata. And so if you decide as an IT team, I’m going to end of life this system, you can copy that data as you wish. Over time, the users and application see the data while it’s in flight, so there is no downtime.

So even if it takes a month for the data to copy across a really slow network, it doesn’t matter because the business isn’t interrupted. So it really does make it, it decouples the application user experience from the IT planning model. So if you’re going to the cloud or you’re coming off the cloud, you don’t have to repoint applications and users to the new instance in the data center or in the cloud. You just keep them on Hammerspace and they’re continually running as IT moves things around.

David Nicholson: Yeah. 20 years ago we had a name for that. I think it was called a fantasy.

Molly Presley: Magic.

David Nicholson: Magic, exactly. We are here at SC24. This conference has grown over the decades to be absolutely amazing. Of course, AI’s front and center here.

Molly Presley: It’s great to see how big it is this year.

David Nicholson: You talk about high-performance computing, supercomputing, the requirements of AI, freeing up this new resource that is Tier 0 in a pool. That’s a pretty big headline.

Molly Presley: Yeah.

David Nicholson: Do you dare throw shade on your friends at Hammerspace and mention anything else new that you’re doing? Or do we want to just stay firm on Tier 0, other cool stuff?

Molly Presley: Tier 0 is the big news for sure, because it’s the first time anyone in the industry has been able to tackle this and it’s so needed. But we absolutely have new advancements in our object storage capabilities. We have new advancements in the speed our metadata transacts because you have to always be increasing the speeds of your system. Being able to really interlock an AI workloads, the S3 and object data that’s coming in, and then process it to GPUs, having that be really transparent so you don’t have your object workflow and your file workflow is important. And it all ties together that you want your GPUs to be efficient, you want them to access as much data as possible. And if that data has been ingested over S3 or file, it should be transparent. So those are the other advancements we have, but none of them are as really industry changing as the Tier 0 announcement.

David Nicholson: Yeah. Yeah. Well, Molly Presley, it’s always great to see you. I have to say the announcement of the week, and it is day one or two of the conference right now, the announcement of the week has got to be that Hammerspace is giving away free NVMe storage basically because you already paid for it, right?

Molly Presley: Yeah, absolutely. That’s right.

David Nicholson: Okay. Good.

Molly Presley: You already own it. You may as well turn it on and use it.

David Nicholson: Okay. Personally signing the contracts for that free storage, Molly Presley. Molly, thanks again. It’s so great to see you.

Molly Presley: Happy to take a PO for a bit of software.

David Nicholson: For Six Five On the Road, I’m Dave Nicholson. Thanks for joining us.

Other Categories