Perspectives on the AI Data Pipeline with Solidigm
On this episode of the Six Five Webcast – Infrastructure Matters, host Camberley Bates is joined by Solidigm’s Rita Wouhaybi, Ph.D., AI Fellow for Solidigm, for a conversation on the intricacies and future directions of AI data pipelines.
Their discussion covers:
- The evolution and current state of AI data management
- Challenges in building scalable AI data infrastructure
- Solidigm’s unique approach to addressing these challenges
- Future trends in AI data processing and analysis
- Insights into best practices for enterprises adopting AI technologies
Learn more at Solidigm.
Transcript
Camberley Bates: Hi everyone, I’m Camberley Bates. I am your host of Infrastructure Matters, Insider Edition, and I am very pleased to welcome Rita Wouhaybi on our show. She is a fellow at Solidigm and we have got some great stuff to talk to you about AI and data. Welcome, Rita.
Rita Wouhaybi: Thank you so much for having me. It’s great to be here.
Camberley Bates: Let me give you a little background with Rita. I have this incredible, I’m not an engineer, most of you have seen me on here now there’s an engineer. So whenever I’m with this incredibly intelligent engineer, especially this female, I’m in awe. And so I’m definitely in awe. I’ve had a privilege to spend some time with her. She’s most recently came in from Intel where she was a principal engineer in the office of the CTO. Really cool stuff that you’re going to hear about. She’d been working in the manufacturing segments and some of the network stuff before she joined as an AI fellow for Solidigm. She’s built some great things having to do with Edge and AI models and a couple of different things. She’s got a PhD in electrical engineering from Columbia. She’s got like 400 patents, 20 published papers, oh my gosh. It’s quite the luminary that I’m sitting here with. Thank you.
Rita Wouhaybi: Thank you. I don’t know, I think I’m shrinking right now.
Camberley Bates: That’s okay. You deserve it all. So what I told the team here is what we’re going to talk about is AI and data. For those of you who don’t know, Solidigm is in the business of data, is in the business of flash, the things that all this stuff is stored on. And I know when we first started talking about this, you were very involved with the Edge and AI services and tell us about that cool stuff that you’ve been doing with what have you been working on?
Rita Wouhaybi: Yeah, it has been an amazing ride, actually. I’ll start a little bit of background about myself. Years and years ago, and I’m dating myself right now, I got enamored with something that we used to call neural networks and I did my masters in neural networks. But then at that point in time, AI had over promised and under-delivered. But my love for it did not stop. Even though for a while I went on a trajectory of distributed systems and networking, I still strongly believed in creating a model and creating a program, if you will, a software piece that can learn from behavior and pattern. But none of that set me up to honestly serendipity and the last few years and what I have done. So when I, at Intel, moved to the Edge group and started working with customers, it was such an amazing experience, but also a humbling experience to figure out what it means to have all this data generated and try to extract value out of it.
We did a lot of work with large manufacturing names like Audi Manufacturing, and we were lucky for these suppliers to actually trust us with the data set and ask us for hope trying to find solutions for some of their business problems. So for Audi, the solution was actually very, very eye-opening. Every time I went to their factory, I would be in awe watching the robots. I was like that kid that you see watching the washing machines spin and they could do it all day long, that was me watching these magnificent robots that they had programmed and the expertise behind it to build a car chassis from scratch. However, they were having problems in their welding process, they would apply welds to cars and some of those welds wouldn’t take and they couldn’t inspect them in real time on the line. Their inspection was an offline process, very costly, could not scale, included literally a lot of engineers coming in there with notebook and writing notes and trying to test one welding spot at a time.
So being able to bring AI to a leader in manufacturing like Audi was honestly a huge privilege that we worked on and we were able to use the data set they shared with us to inspect everything online and Audi scaled that solution worldwide and went public about it. That’s why I’m able to mention them. They even patted me on the shoulder on LinkedIn and we wrote some papers together and there were videos produced and so on. But I learned a lot. I was very lucky because the learnings were amazing and opened my eye on what else can AI have an impact and solve real problems for real people, whether it’s assisting someone in their day-to-day job in a retail store, or helping a health professional be able to do faster and better diagnostic or going into manufacturing to produce better products, protect the humans, and do all kinds of interesting business use cases.
Camberley Bates: Wow. And that’s kind of what we’re going to be talking about is how we take this, your incredible knowledge base and as a fellow now, or AI fellow in fact at Solidigm, to talk about what is going on in terms of the preparation of data, preparation for AI requirements, how it’s processed, because yeah, it’s great, the manufacturing is where the ultimately we understand the influence engines and everything else is, but all this other stuff has got to happen on the back side of it. So let’s start there. And I know I had the privilege of seeing you speak at a couple conferences and on this topic, which is why I wanted you on here. So let’s talk about what I’ve been calling the pipeline of data that gathering the quantity, the data prep, etc. So what are you looking at or what are the problems that you see happening there?
Rita Wouhaybi: Yeah, it’s a little bit ironic that I get paid by a storage company, but I’m looking at ways of making the data more manageable and reducing it perhaps at some point. Because honestly, it all starts with the data. And today, if you look at the AI community, there are kind of two forks going in parallel and both are unbelievably interesting. There is the LLM and all these foundation models that are getting created with amazing value and obviously have created a lot of traction for solving some of those use cases. But one thing you have to keep in mind is that these foundation models, these AI models that are very large and are crunching amazing amount of data, right, data that we would not have imagined only a decade ago, and corpses of data, they are not super sustainable from a power, from a compute, from just even an operation perspective.
So when someone creates a model that is whatever, 120 billion parameters and says, “Here you go, it’s going to solve all those problems”, that’s great but for some use cases, that’s almost like me saying, “Hey Camberley, let’s go on a hike, but I’m going to bring this truck to use.” It doesn’t make any sense. The trail is small. So there is a second growing body of research from both academia and industry and myself and my team are very focused on it, which is instead of just dumping data brute force, saying, “Let me learn from these huge corpses of data”, can we actually find automated way, not me, data scientists who cause an arm and the leg sitting, sifting through the data and annotating it, but can we automate ways to know what data is valuable?
And I’m going to give you an example here that actually did hurt, it happened a few months ago before I left Intel. We were working with a customer that for obvious reasons I can’t name, and this customer had a warehousing application that we were helping them with. We were looking for defects, we had cameras, the cameras were watching trucks being unloaded, products being unloaded from trucks and focused on it. And we’re looking for defects in those products. It’s great. We were doing focus times where they had people from their end, people from our end on call, we’re collecting data, we’re checking out are we able to solve this problem. And at some point the model was getting more and more stable and generalized and converting and doing all the good things, and they are in a different geolocation, so they would work during our night here on the West Coast and we would wake up in the morning, get some data and see how did the model perform, do we need to adjust.
And then one day we started looking at the data that we were getting the output of the model, and it was like, “Whoa, that’s a lot of defects. What happened?” And it was like a crazy amount of defects. We were finding a handful of defects every day, which is how it should be in a good operation. So it turns out that they finished that day early and the last item, the last shift, the last few things that they unloaded, there was one defective item. So they brought a dumpster to put that defective item in it. They plopped it in the dumpster and they left the dumpster in front of the cameras and went home.
Camberley Bates: Oh, geez.
Rita Wouhaybi: It’s funny, but it also cost a pretty penny because here’s the thing, we are trying to do a lot of optimized things, so when we are streaming from multiple cameras, by the way, high resolution streams, first of all, we’re looking to see if there is a product, if there is no product, we don’t even activate the AI pipeline because it’s high compute. It’s a lot of compute that you’re wasting. So we were finding product constantly, and then again, our AI pipeline has multiple steps in it, but they were all getting activated because it was the worst case scenario. It was an actual defective product and we were fighting it every time. So what ended up doing is that they collected all this data. We were consuming power, we were consuming resources, crunching high peak on all the devices, on all the cameras, we were storing all that data, and then they transferred all that data. So imagine network bandwidth, power, all of that got wasted, all got transferred to the cloud, and then all that got transferred from the cloud down back to us to look at the data. And then needless to say, the hours we spent sifting through the data. So… Go ahead.
Camberley Bates: Well, what I was listening to is like, okay, so this is how different AI and what we’re doing is different than HPC because we’re bringing AI actually into the production, it’s like a transactional system. And so all the principles that we have for data management apply here in terms of speed, performance, data protection, something called rollback, getting rid of bad data, cleansing. So it’s just amazing when we talk about this, that’s how different it is that I’ve been part of the super computing world for a very long time and people are asked, “Well, how is this different?” It’s because it’s right in the middle of our systems and what we’re doing, which goes back to some of the stuff that you guys are working on.
Rita Wouhaybi: Exactly, exactly. At the end of the day, the engineer in me who strives for efficiency hates to see those inefficiencies exist everywhere. And honestly, they don’t just exist at the Edge, the inefficiencies of redundant data and data that looks very much alike exist on a lot of data sets. I mean the exception could be things like that seminal paper that came out of Google over a decade now of dogs versus cats, identification.
Camberley Bates: Yeah, I saw that.
Rita Wouhaybi: Yeah, I think the jury’s out there one year there are more dogs than the other year there are more cats, but overall it’s fairly balanced. But real life is not fairly balanced, thankfully. Thankfully there are way less defects than good products. Thankfully there are way more healthy MRI imaging than unhealthy. But that means in the sea of data that looks very much alike, this is where the challenge is, what does look alike in similarity and how you figure that out differs based on the context. But that’s a very exciting area in AI of saying, “I want to start with interesting data rather than start with all the data.” And as a result, that brings it back to data is what AI runs on and is very much, it makes sense to reach a point where we know we have achieved so much in AI and we are able to continue to achieve and mature it. But now let’s also look at this stream, this spring of data or tsunami of data, if you will, and understand are there interesting things that we can do and what does that mean for the compute in general.
I mean, I looked up the other day because I was curious and you mentioned that I have a PhD in EE, in Electrical Engineering, and I remember back in the day when I took computer architecture, the Von Neumann architecture was the golden standard, but if you think about it, that’s changing. We are seeing more and more shift that has happened. Networking did that already. More intelligence moved into SmartNICs and now DPUs and IPUs, and I think we’re going to see the same happen towards the data, towards the storage and what it means to have intelligence there.
Camberley Bates: Okay, so then let’s cut to that because we’ve talked a broad range of topics here regarding to AI and specifically Solidigm, people sometimes in flash drives it’s like, okay, so what does that have to do with what you and Rita, myself and Rita have been talking about? It’s like, okay, but it has a lot to do it because as we talked about, and you’ve laid out very well, is that pipeline about building technology that deals with the storing of the items, builds with the training of the items, builds with the rag of the items, builds with the inference of the items, etc. But kind of like what are the areas that you can talk about right now that you’re working on or the vision that you’re working on from Solidigm that people would be interested in hearing about?
Rita Wouhaybi: Yeah, first of all, I think the biggest one is that I don’t see Solidigm as a flash and storage company, I see Solidigm as a player in the data pipeline, honestly. And I think for us, all of us to succeed, we have to bring intelligence to the entire pipeline. It can’t be that I’ll just collect data blindly and then have to deal with it later. That is going to go away, I think, as time goes on and as enterprises start bringing some of these technologies in-house. Enterprise AI and on-prem data centers and applications for big companies are on the rise. You mentioned it, LLMs and Rag and Vector DB because they want the goodness, but they don’t want to expose their data. So it makes a lot of sense for us to start bringing some of that indulgence as part of our offering from Solidigm.
So I’m very excited actually to join Solidigm and start thinking about what it means to bring intelligence to the data itself, what it means to figure out how can be more sustainable. We already in Solidigm, Are have been doing a great job, in my opinion, in both capacity as well as power and power consumption on the drive, and those two are top of mind for enterprises, for individuals, for the cloud operators, for the hyper-scalers. And that is top of mind for us as well. But in addition to that, what does it mean to be part of that smart pipeline, how else can we offer our customers to protect their data, but also to be able to distill, what if their data should be used for what, because again, people are being crushed.
Camberley, it is so easy these days to add sensors to your environment, whether it’s a camera or temperature sensor, or to enable a controller to produce a lot of data in manufacturing or MRI machines. These sensors are producing lots and lots of data, and the data, honestly is crushing people. They don’t know what to do with it. They’re too scared to delete it. That does not make sense anymore, especially in this age when we know it’s valuable, but they don’t know what to do with it. And it’s very difficult to collect… Not to collect, sorry, I take that back. It’s very difficult to maintain, to store, to move around, it’s costing them a lot of money with hopefully a potential. So I think we need to close that gap. We need to figure out what is going to give them the potential and help them in identifying it.
Camberley Bates: Yeah, and I know that you all, even just one of the little things about right now, you guys are shipping what 61 terabyte drives, and as we look at the sayings, if I can take that consolidation down, one of my clients that we had, I know just went through a cycle of taking out all of their hard drive systems and bringing in all solid-state systems because they didn’t have enough power.
Rita Wouhaybi: Yep.
Camberley Bates: Out of cycle. It’s like that’s a really costly kind of thing, but that’s what they needed to do. And we’re seeing that as well with other places. Now that was Europe, mind you, not necessarily in the United States as much, but definitely over there, that issue on power and what we’re doing, I mean, that was even before the AI thing hit. And so that gets pretty scary stuff. Well, we have been here, we’ve been talking for about almost 20 minutes now. Rita, thank you so very much for joining Infrastructure Matters. This is the Insider Edition. And any last comments for our listeners?
Rita Wouhaybi: I would love to hear from your listeners and whether they agree or disagree, I think it’d be great conversation. I’m always open for an intellectual debate, so thank you so much for having me. This was a lot of fun.
Camberley Bates: Thank you very much.