The Vital Role of Data Orchestration in AI and GPU Workloads
There has been a lot of focus in the industry on how to deliver the performance needed to GPUs as infrastructure teams embark on LLM training, GenAI and Enterprise HPC projects. As the projects expand, data orchestration is crucial to provide efficient and timely movement of decentralized data to available GPUs that likely are not local to the data. By automating and managing the data flow, data orchestration minimizes latency and maximizes GPU utilization, enabling faster processing of large datasets and complex algorithms.
We will discuss how existing AI infrastructure teams are leveraging data orchestration to gather larger data sets and improve their AI results.
Transcript
Dave Nicholson:
Welcome back to our continuing coverage of infrastructure in the cloud here at the Six Five Media Summit. I’m delighted to have Molly Presley, the one and only Molly Presley, join me from Hammerspace where she serves as Chief Marketing Officer. Molly, welcome.
Molly Presley:
It’s great to be here and great to be with this audience for our first time.
Dave Nicholson:
Yeah, it’s great to finally meet you, in fact. Tell us about Hammerspace. For those who aren’t familiar with Hammerspace, what’s Hammerspace all about?
Molly Presley:
Yeah, absolutely. So there’s kind of a fun, memorable way to think about a hammerspace. A hammerspace, if you’ve ever watched cartoons, was the place where if Bugs Bunny was pulling a mallet out of his pocket, the mallet was massive and yet Bugs Bunny was quite small, the hammerspace is the place in comics where something big comes out of a small spot. That’s Mary Poppins’ purse. You could think of a lot of different analogies and with the latest Spidey-Verse, they actually talked about the hammerspace specifically for Spidey.
And I bring that up, that kind of origin story of our name, because it’s very analogous to what we do as a company. That idea of we want to use data in a lot of locations and we’ve historically felt like we were really constrained by physics on how do you move large data sets? Is my networking big enough? And yet with Hammerspace, as a corporation and a technology company, you can bring these really big data sets out even if you have a very constricted space that you’re bringing it through. So what we offer is a global data platform that organizations are using for big data processing, AI types of initiatives.
Dave Nicholson:
One of the better origin stories for a company name, I have to admit.
Molly Presley:
So fitting. I wish I could claim I came up with it, but it was actually one of our engineers.
Dave Nicholson:
It’s very, very clever and apropos. Once you kind of have a feel for what Hammerspace is all about, you’re going to hear people start talking about orchestration, data orchestration. Talk a little bit about that. What does Hammerspace mean when they talk about data orchestration?
Molly Presley:
You bet. So when you think about, if you’re an infrastructure person, this will make a lot of sense to you, and even if you’re not, I think the concepts are pretty straightforward, if you think back the 30 or 40 years ago as IBM and the big mainframe companies were building storage systems, everything was built into the hardware platform. Everything was subservient to the platform of hardware it was brought into. And sure, we’ve evolved to scale out systems and hybrid cloud systems, but still the infrastructure has really dictated how data can be used. And you take that a step deeper and it’s because the file systems are embedded in the infrastructure. So if you want a high performance, high throughput system connected to your GPUs, historically people would’ve thought, “Okay, I need to go buy high performance cluster of storage with certain attributes.” And as the world has evolved and hybrid cloud came about, that led to, “Okay, which is my cloud instance, which is my cloud region, which is my storage cluster,” and they were all isolated from each other but served different purposes.
What Hammerspace has done is brought in a global file system which manages any infrastructure, so any cloud region, multi-cloud, any data center cluster, and provides the performance of like Lustre file systems in the HPC space, but provides that magic part that doesn’t exist at all today of in that global file system, we have automated data orchestration. And what that is is software policies that say this data needs to exist in a certain place based on recent access, the project it’s associated with, the job scheduler or our GPO orchestration tool like Run.ai says, “I need this data set.” We gather, put it local to the compute it needs to run in and orchestrate it to that available compute.
And that might sound kind of trivial, but if you think about in a machine generated data world where you’re dealing with billions of files that you don’t know where they’re located, who owns them, which project it’s associated with, we have all that really smart metadata to know which files exist and then intelligently only move the ones you need to the available compute, which may be in one location, multiple locations depending on where your compute is available. So it’s the whole idea of moving data to the available compute resources that you want to use.
Dave Nicholson:
Yeah, I think anyone can relate to the idea of location being important. If you wake up at four o’clock in the morning and an outhouse three miles away, not great. So no, I mean, it should be self-evident that having data close to where the compute is important. But are you talking about whether the data is in sort of a traditional on-premises environment or in cloud or both? Does data orchestration say that?
Molly Presley:
It’s really both. Yes. So we run in any cloud, any virtual machine, you can spin up an instance of Hammerspace quickly. So what you find in most AI architectures today, and AI is on pretty much everyone’s mind, is there’s a shortage of a couple of things. GPUs, power and data. I think most organizations, even some of the big, like 4G10 tech companies say they have a hard time getting access to enough of their own data to train a large language model. So there’s GPU availability. Nvidia and companies like that need to sort that. Power companies will sort out power. What Hammerspace really helps with is getting access to the data, so knowing what it is, and then placing it where it needs to be to be processed. And so it’s a really big deal to be able to figure out through intelligent metadata which files exist and which you need.
And then being able to intelligently, efficiently move those across potentially small network pipes, locally give high performance to the GPUs and applications using them while files are being moved. There’s a lot of kind of magic that’s happening behind the scenes that’s really been a difficult to solve problem. And that’s why there’s a big debate in the industry. Is data gravity such that you must move your compute to your data? And that’s kind of crazy if you think about it. The thing that’s digital is the data, not the infrastructure. That should be the easy thing to move. And it’s been solved in the consumer space. Think about how your iPhone, I’m an Apple user, but I know this is true in other technologies, same with Samsung and whatnot, but you have the pictures you take on your cell phone and you want to see them to edit them on your iPad and then on your computer you’re receiving text messages.
That’s all happening because iOS is a data orchestration tool for the consumer world. I don’t know where my files sit, I don’t know where my pictures sit, and I know they’re in the cloud somewhere, but I don’t even know which one. But I do know I can access and see all of them anytime I want. Sometimes there’s a bit of latency as it’s being pulled to a device, but it’s largely unnoticed. It’s really what we’re doing in the enterprise world that makes that data viewable and accessible anywhere from any device compute application versus it being dedicated to one iPhone or storage system and having to figure out how do I get backup, how do I move my files over to something else? So it’s really following a trend that we’ve become so dependent on in the consumer world.
Dave Nicholson:
Yeah, it’s interesting you use that as an example, the consumer device. On a daily basis, we don’t care where that data is as long as it’s accessible. Similarly, in the era of cloud, we have come to a point in time where it’s appropriate to say, “I don’t care what the infrastructure looks like on the backend. I want the service that will be delivered on top of that infrastructure.” Now we’re talking infrastructure here. Hammerspace is more of an ethereal infrastructure thing as opposed to the hardware that it’s orchestrating data upon. What does that hardware look like underneath? How much do you care about that, specifically in the AI space?
Molly Presley:
Yeah, it’s a really good question. So we’re a software. We’re a global file system with a bunch of policy engines that move things around. We have two Linux kernel maintainers that work for us. So we try to drive as much of this as we can into the Linux kernel so that customers aren’t required to use proprietary drivers, figure out how they can get systems up and running with us. So there’s a lot we’re doing to make it very accessible within the organization. But when you look at the hardware, there’s a full spectrum. We don’t care. But hardware has physical performance limitations and NVMe device is always going to be faster than the tape drive. I mean, that’s just reality. So when we look at it from our perspective, our customers typically have multiple different types of storage systems. They may have a 7-year-old Isilon, a brand new VAST, some Google Cloud and some Amazon S3 and a Spectra tape library, because we were talking about Spectra a few minutes ago.
All that data and metadata is ingested into our file system, and you can view it all as a single entity. As far as which hardware customer chooses, we certainly help them. We require some resource for metadata services and things like that. But then the rest is just the performance and cost attributes of the storage. So Facebook, Meta, I still sometimes make that mistake, Meta uses us for Llama 2, Llama 3 training and they prefer very specific hardware that’s very standards based. As we all know, they’re very involved with OCP, they have a big partnership with Pure. So we don’t care. There’s Pure, there’s OCP, there’s some other just kind of white box stuff sitting underneath us for Llama 2 and Llama 3 training.
But then you go look at maybe an enterprise that’s been up and running for a hundred years and they have a bunch of old NetApps and IBMs. Those also can be presented. And we unify the data. And then maybe you have this old, let’s just say IBM box that’s been in production for 10 years. You still want the data, but you want to have some faster access. So when the files do need to be faster, we’ll orchestrate them to something fast, but you don’t have to throw out everything you’ve already invested in.
Dave Nicholson:
Ah, yeah, you stole my thunder on the next question here because the…
Molly Presley:
That’s good. Exactly.
Dave Nicholson:
Well, exactly, exactly. It’s because you’re a pro, you’re a professional. So yeah, because it’s one thing to say, “Oh, fantastic performance. We’re here in the future.” The folks charged with managing infrastructure on the backend are thinking, “Okay, yeah, what about all this legacy stuff that I have?” Do you help migrations, essentially? And when I say migration, I mean the removal of old technology, the unscrewing of the incandescent light bulb and the screwing in of the LED light bulb in a way that is completely transparent to the enterprise. You basically just said that yes, you do that, but I want to confirm that that’s what I heard you say.
Molly Presley:
We do. And if you’re like me, at least, I haven’t heard the light bulb analogy, but it’s a good one.
Dave Nicholson:
I just made it up.
Molly Presley:
You know, I’ve replaced most of the light bulbs in my house. I still wait till the old ones die because I’m too cheap to just pull them out and throw them away when they’re still working. And that’s kind of the way enterprises run too. They spend a bunch of money on infrastructure and for economics, green initiatives, and just time because migrating, doing a data migration is so time-consuming, they try to leave it in place as long as they can. And what ends up happening is they start to fall behind the organizations who are doing greenfield new deployments because they’re slower, they’re less nimble, that type of thing. So to data migration, yes, we help a ton in that you bring in Hammerspace, and again, we’re the active file system when you implement us, connect the applications, the users, the GPUs to our file system, just standard NFS, SMB, S3 connections. So we’re standard space as far as how you connect to us. And at that point we assimilate the metadata from that old infrastructure. And you can add in new infrastructure as well. But that’s all transparent to the users and applications.
So let’s just say you do have that old IBM and it’s finally just petering out and going to die on you. That’s okay because the data and the metadata are already in Hammerspace and the files that you want to migrate can be migrated over days, weeks, months, however long it takes. And the users and applications will not be affected because they’re working with metadata, not the files themselves. And so the files and movement while to the new fast system or the cloud system, it doesn’t interrupt operations. So the problem with data migration is a couple things. Do you really want to plan downtime for the entire weekend or a week as you do a migration? That’s very disruptive and it just takes a lot of time. It takes a lot of IT’s time manually copying things. And so that’s completely removed when Hammerspace comes into the environment. IT and infrastructure are decoupled from the user experience with the data because we become the data layer.
Dave Nicholson:
Molly, let’s talk a little more about performance and how Hammerspace can deliver the kind of performance that’s needed by these data hungry GPUs today.
Molly Presley:
Yeah, you bet. So certainly the first thing organizations think about is the design their infrastructure is. “Okay, I’ve bought a few hundred GPUs.” If you’re Meta, it’s tens of thousands of GPUs. “How am I going to keep those busy?” And having a high performance file system, typically a parallel file system that you would use historically in the HPC world, is the first thing people think of. Hammerspace absolutely provides performance. Right now we’re streaming in large language model training at Meta well north of 24,000 GPUs at a time. So the speed matters. There’s a couple companies who are kind bubbling to the top, us being one of them, of how you keep that local performance really rolling to never have your GPUs slow if you have the data. But the cool part that we add to that is what if you need access to the remote data sets, to other data sets from other organizations, other locations? And our integrated orchestration is a key part to these more decentralized AI workloads.
Dave Nicholson:
So with the minute or so we have left, I’m going to hit you with something that may be more than a minute-long answer. So we’re going to see just how professional you are here, Molly. Hammerspace’s development has preceded the dawn of at least the current AI era that we’re in. How is it that the architecture has been able to stay fresh, and what kinds of changes have you had to make in order to make this truly appropriate for the age of GPUs and AI? Have there been any hurdles you’ve had to overcome?
Molly Presley:
Well, I would say there’s always a little bit of luck when your technology comes out and the need for it and the need for our technology in AI is massive. But as you think about what we’ve needed to do, the technology’s been generally available for about two years and in development for 10, which is about how long it takes to build a file system. Our CEO was the founder of Fusion-io, which was the company who actually decentralized data to start with by putting those Fusion cards out in servers. And he’s had a vision towards this of, “Great, I forced this decentralization, but now how do you use the data?”
And we’re very closely partnered with Gary Grider at Los Alamos National Labs, who was kind of the grandfather of a lot of the HPC file systems, and we’ve been working in conjunction with them and the Linux community of how do we solve this decentralized data problem? And it’s just been exasperated by now we have a lot more remote workers through COVID, now we have VPU shortages, and so there’s more need for it than we potentially expected originally. But it’s just kind of the perfect storm of why Hammerspace is needed and the technology is just showing up in the market at the right time, to be honest with you.
Dave Nicholson:
Yeah, no, very interesting. Serendipity is always something that could be a beautiful thing. But no, that history back through Fusion-io in the sort of indirect way and direct way that the thought process is linked really has set you up in this market very, very well. It’s very, very interesting. Molly Presley, thanks so much for joining us here at the Six Five Summit. Molly Presley from Hammerspace, thanks again. I’m Dave Nicholson. Stay tuned for more on infrastructure and other topics here at the Six Five Summit.