How Stable Diffusion was built: Tips and tricks to train large AI models
taibaili2023 2024-08-25 12:01:02 阅读 99
How's everyone doing? Yay. Thank you for being your last session for us today. I'm sure you're waiting for your beers right now. Now we do have a little bit of a issue. Our CEO that's gonna be joining us today is running late from a meeting that he had with SageMaker. Um but that'll be an interesting ML keynote tomorrow that he's coming from. So before we get started, and by the way, [CEO name] will be joining us soon. So don't worry. But before we get started, I want to get a feel for who's in the room.
Um so first question, who is familiar with models like GPT-3? Just raise your hand. Ok. Pretty much everyone. Ok. Cool.
Who knew about generative AI like seven months ago, seven months ago. Seven ok. More than I did. It's new for me. I'm just kidding.
Um next question, how many of you see yourself potentially training a foundational model in the next x years? So raise your hand, foundational model. Ok? Cool. Those are the heavy hitters. They're gonna be using a lot of GPUs.
So, last question do you see yourself fine tuning? Fine tuning. Ok. So just say it out loud. Are you gonna be fine tuning? An open source model? For example, something mostly open source. Anyone fine tuning, something like that's close sourced out there? Ok. Cool.
All right. It sounds like we got the right audience. I think everyone knows the space pretty well. So I'll introduce myself. My name is Farshad. I'm part of the business development team that works with customers doing machine learning. Um I basically help customers build some of our largest distributed training environments and also inference infrastructure. I'm here with Pierre.
Yeah. Uh my name is Pierre Evakant. Um I'm leading a solution architect team called the Fra mework ML. But typically we actually take care of self manage, uh many workloads on KS badge part cluster, anything that is large, ugly, hairy and hard to solve.
Thanks Pierre. So the talk today is called How Stable Diffusion was Built and Tips and Tricks on How to Train Large AI Models. Now, the good thing is that some of you are interested in building foundational models and some of you will be doing fine tuning. A lot of the stuff that we'll talk about today will apply to both. We're probably not gonna spend as much time on the inference side, but we'll certainly spend a lot of time in training.
Um and then we'll also uh go into, well, when I just show you the agenda that'll make things a little bit easier.
Um so here's our agenda. We're gonna start with a pop up quiz about generative AI just to get the blood flowing. But these questions will have prizes. So I'll give you some, some water bottles for that. Then we're gonna go through the recent history of AI and some of the trends that it's causing when [CEO name] gets to the CEO, he's gonna cover the cluster that was used to train Civil Diffusion. And he's also gonna show you a preview of a lot of, a lot of the new models that he's working on. So that'll be really interesting. But then we'll get to the kind of the meat of the operation, which is how do you actually build this stuff on AWS, which is what Pierre will be covering?
All right. So first pop quiz and whoever gets this right, gets a water bottle for me. What is the largest cluster size that has been publicly announced for training GPT-4100? That's correct. Can I throw this to you? Is that cool? Yeah, that's right. Ultra cluster. Yeah, we'll touch on that. So good job.
So ironically, it's actually from Stability AI and that number is now past 4000. Now, what's cool is if Stability wants to, they can use those 4000 GPUs for one training job, which is not out of the question. We do see customers doing things like that of that scale.
All right. So next question, keep in mind, 2025 is like two years away from now, right? So you're gonna see things like new images, videos, probably metaverse stuff, code that writes code Gartner. According to Gartner, what percentage of all data will be produced by generative AI by 2025? Raise your hand over there. 10%. That's right. And I think you're right, I think it is probably too low.
Um and the reason why I think it's too low is because this space is growing really, really fast. And if you kinda get a sneak peek into what it's capable of doing, it's, it's a little bit sometimes uh scary to be honest with you what it's capable of, but it also doesn't just apply to things like, you know, images and video and, and in general entertainment, it also applies to things like drug discovery.
So Garner estimates xx percentage of drug discovery will use gene AI by 2025. So one more hand over there, would you say 25? It's actually 50%. Yeah. So this new technology is, is gonna change a lot of different industries.
My catch my throats are not that good. So, thank you dear.
What now what's been causing this space to kind of grow so fast? Right in the last five years, there's been so much changing in the AI world and it's really come down to one paper with the most meme sounding title I've ever heard of in the paper. And that paper is called Attention is All You Need, doesn't sound technical, but it captures people's attention, right?
And this paper introduced a new architecture called the Transformer architecture. Hence the transformer over there, probably Optimus Prime.
Um and it really add two contrib added two contributions to the ML space. The first thing it allowed computers um doing AI to really efficiently use parallel computing, right? So that was huge. The second thing is that introduced the concept of attention which allowed in the case of NLP AI to understand the relationship between words.
So if you've heard of things like GPT-3 Bets and now generative AI like Stable Diffusion, these are all the results of the transformer architecture. And if you talk to a lot of the thought leaders in the space, what they'll tell you is they don't really see the transformer architecture changing too much in the next five years. And that's also why you see chipset manufacturers like Nvidia, for example, implement the transformer engine in their new chipset coming out next year called the H100.
Now, if you're curious about the workflow of how this generative AI space and general transformer space goes, it is as follows.
Um so the first thing is transformers require a lot of data, right? Sometimes models um or of just one type of modality as we call it. So for example, text or images, you take an enormous amount of data and you use that to train your transformer model also known as foundational model, right?
Then what you do and this is the analogy that [CEO name] told me yesterday over lunch, which I really like the foundational model is kind of like a high school degree. You get it as like your base level of knowledge, right? But then you wanna go and become an expert in a specific use case, like for example, computer science or art or marketing. And that's really what fine tuning is, right? It's fine tuning it for a specific use case that allows the model to perform a lot better.
One thing that's interesting that I've observed is that if you took a really, really strong foundational model and you didn't fine tune it for a use case, an ok, foundational model with fine tuning for specific use case can outperform a foundational model.
So what that means is that a lot of these open source foundational models are gonna be released are gonna have a lot of applications that are still gonna be very useful for the world.
Another thing to keep in mind is, you know, talking to the VC community, what they'll tell you is that they expect there to be roughly maybe 20 of the companies that built these foundational models, not too many, but you're gonna have probably thousands of companies taking advantage of the fine tuning, right? And finally you got inference, right? And the inference is gonna be a lot, right? So you got a few of these, a lot of these, an enormous amount of these right.
Now, this is the AWS ML stack. What I really like about the stack is that we, we really do try to make it easy for you to use any part of the stack, right? So if you don't want to do much ML, you can use the top of the stack where for example, you're just using API calls to interface with AWS. So for example, if you want to do computer vision, you can send an image via API to Rekognition and I'll tell you what's in the image. Some of these services actually offer fine tuning as well. You can send your data to AWS and you can train it for your use case.
Now the middle of the pack is SageMaker. I see a lot of customers use SageMaker. And for some reason, so far, a lot of the customers that I've worked with in the large language model space or the generator I space, they tend to actually operate at the bottom of the stack, right? And that's been the case for Stability this last year, right?
So what does the bottom of the stack really mean? It means that you're using, for example, EC2, you're using, for example, um storage services like S3. And what you tend to find is that the customers that are doing what we call self managed, what they tend to do is they'll, they'll have pretty consistent architectures that they use, right?
So some customers for orchestration may use EKS, some may use um ECS um some customers um use for example, Trainium, right? That's gonna be really good for distributed training. And a lot of customers use A100s which is our p4d instance type in the case of Stability.
Um these are the specific services that they use, right? So they use A100 for training, they use S3 for storage and they also use FSx for Luster which is great for distributed training um for their orchestration, they actually use ParallelCluster, right? So if you, you if you SLM in the past, that's a great service to use. And finally, um EFA becomes really important in the space of distributor training as well.
Now, the fun fact about Stability is they're actually in the process of testing Trainium one and infringe two which was just I think announced like an hour ago right now, here are some fun facts about Stability.
Um so first of all, a lot of times people think that Stability is training one model at a time, right, which sometimes is true, right? Sometimes Stability will take their 5000 cluster of GPUs well 4000, but now it's 5000. Um and they may train one model like one very, very large model, right? But what they also do is they train 10 models at any given point in time.
Most of the time, the benefit of doing this is that it can really reduce your spikiness of your workloads and kind of have like a flat usage throughout the year. The benefit of having flat usage is you can use things like savings plans to really bring down the cost of the GPU slash instances that you're using.
Um has anyone seen Stable Diffusion 2, by the way, the launch of it, raise your hand. I'm curious, what, what are y'all's thoughts about Stable Diffusion 2 good so far, but anyone thinks bad so far, you can be honest. Ok. Some thumbs down. Ok. Cool. Yeah, I've been monitoring as well. I just saw today that Stable Diffusion 2 does allow AI to actually make hands, right? So that's pretty cool.
All right. And we got our CEO and his account management team walking in now. Perfect timing you can put today.
Thanks for joining us, [CEO name]. You're very soon now. Stability uh just launched Stable Diffusion 2. Was it Friday of last week? I believe Thursday?
Uh Stable Diffusion 2.0 took 200,000 A100 hours to train, right? And I i could be just guessing here imo but I imagine your future models are gonna take more and more overtime. 1.2 million. There you go 1.2 million hours doing the OpenCLIP model.
Um someone hit on Ultra Cluster earlier. I think it was you. Another thing keep in mind is that when you're building these very, very large clusters, you really want to take advantage of Ultra Cluster and EFA to optimize distributed training Ultra Cluster basically allows you at a high level to make your giant cluster look like one supercomputer, right?
So instead of me talking about Stability, I'm gonna hand the mic to [CEO name], [CEO name] maybe you could do a quick intro and then the slides are yours.
Cheers buddy. Thanks man.
Hello everyone. Thank you for staying here or not playing blackjack or getting drunk right now after this.
Um hi, I remember mostack, I'm CEO founder of Stability AI.
Um previous life, I was a hedge fund manager. Then I led the United Nations AI initiative against Covid-19. Did AI work to repurpose drugs for autism and decided, hey, why not make open source AI for everyone. So we can always have an augmented future.
The Stability platform basically is based around open source. So it's scale and service. So, you know, we built communities such as the Luther Lyon and others that we're supporting and we'll spin out into independent foundations on a vertical basis for language models, code models, image models, protein folding and other things.
Um we offer APIs and integrations. So, you know, just provide it at a ridiculous scale and there's gonna be some interesting stuff about that soon. And then we go into the largest companies in the world and build gigantic models.
So that's a TBD kind of coming thing. You need to have supercomputers talent and data in order to build these models. So from a Lutherian Lion, we had the PILE which was one of the most commonly used text language data sets at Lyon, we built Lyon 5B. So Lyon previously, the largest image data set was about 100 million text image pairs Lyon 5B is 5.6 billion and the new version will be even bigger.
Uh the PILE version 2 will probably be about two terabytes of data as well. Uh on the talent side, we've got our core team, our community and academic partners. And then finally, for scale, we got Amazon AWS can't get more scale than that.
Um we decided to go very big. Uh so uh we started a year ago with two V100s I think. And as of a couple of months ago, we had 4000, A100s in one ultra cluster, all same spine optimized to a ridiculous degree by the AWS team.
So to put that in context, these are the largest public, A100s um I think it's about the 11th fastest public cluster in the world full stop. Uh right now actually, we have about 6000. So uh getting up there about Pearl mater size one day, we'll catch up with uh Meta one day next year. So, um yeah, it's been quite something really understanding and learning how to use kind of clusters of this size.
This is from the State of AI report, which is a fantastic one as well. And I think this is part of the exponential nature of this technology
So the fastest supercomputer in the UK is Cambridge One at 640 A100s. Like the fastest one in Canada is probably Norval at 636 A100s. In France about 440. So this kind of shows the scale of the exponential nature of what's required to build some of these models because even though we know we trained Stable Diffusion on 256 A100s, we've done training runs up to 1500 for some of our other models, particularly the next generation AI architectures around image and some of these other things. I think a lot of you are kind of here to hear about Stable Diffusion.
Uh latest Diffusion++, Stable Diffusion was a collaboration with Comps team, University of Heidelberg Munich, Runway ML, Anthropic, Luther, our own Stability team, kind of driven and run by Robin Rombach, who's one of the leads on generative AI at Stability along with Katherine Crowson and Riverside Wang.
So over the last 18 months, we've been funding the entire open source AI art space and kind of moving from kind of generative models towards these diffusion type models. Originally, it was a generative model that was then guided by, for example, CLIP. So VQGAN and CLIP was one of the original ones with AAA. So you had a generative model and then a guidance model, text image of text, text, image and image of text that bounce back and forth from each other.
As the last year progressed, we moved into these kind of diffusion based models instead where you know it kind of works almost as a denoising function. So you start with some noise. This is what seeds are or image to image and then kind of you can denoise it towards that stationary distribution to get to the original target prompt. In reality just looks a bit like magic, but this is kind of the high level thing. And this has been the real driver of this because we got to Stable Diffusion 2 just a short while ago.
So Stable Diffusion 1 took 100,000 gigabytes of images. So that was on LAION which is about 2 billion images and created a two gigabyte file that could do just about anything. Um Stable Diffusion 2 we adjusted because with Stable Diffusion 1, we had a image generation model and then a text model. So that was OpenAI CLIP model that they released in January of last year to get this all going. But we didn't know what the dataset of CLIP was.
So when we had a lot of questions around like artist attribution around not safe for work and the other things, we had no idea what data was in there. So that's why we train this OpenCLIP model with Anthropic, which was about I said 1.2 million A100 hours. So we knew on the image generation and text generation model what happened on either side.
These are some of the examples of generations you can get with Stable Diffusion 2. And we reduce the time of inference of this from about 5.6 seconds on launch to 0.8 seconds at the moment. And tomorrow we have a very big announcement, don't think I'm allowed to say it here about inference times which I think will change the game again.
These are all raw, unedited outputs as well, which is a bit insane. I think who would have thought we'd get to photo realism so quickly. Um but one of the main things with Stable Diffusion 2 is that we wanted to have a flat architecture. So we did deep embedding and a whole bunch of other things so that certain things wouldn't be overfit. This allows us to use it as a base for fine tuning.
So someone took this was, um, so Hugging Face. So I just run I will credit properly in a minute. Uh one of the open source researchers from Hugging Face, um created this model fine tuned on just 10 images. So you can use the Diffusers library that they have to do this in like 10 lines of code where they fine tuned it to Mad Max. So you can take your own face. Some of you will have seen that and use textual inversion or use DreamBooth or some of these other technologies to really point to various things.
In fact, what some people are doing now because the entire distribution was flattened. So artists like Greg Rutkowski and certain celebrities were pushed down in the prominence order is they're doing embeddings to bring these things back up in the kind of latent space distribution because that's what it looks like when you take 100,000 gigabytes of data and compress it down into two gigabytes of knowledge. I suppose it's kind of these spikes of latent space whereby you understand various things and principles of the nature of.
Well, Emma Watson, this which was like 5% of all images generated in our Discord bot when we were doing the beta. Uh it doesn't do Emma Watson anymore. I don't wonder why that is. Um so tomorrow we're actually relaunching our Discord bot as well. So we'll see what people do. I think that's pretty cool, you know, like look at that, let's hope that isn't the health cap actually is that Las Vegas, let's see.
We also released a variety of other models because I think the image model is enough. So we used a depth to image model that kind of was based on MIDAS that allows more accurate kind of image to image. So this allows kind of a transformation by taking a depth map that can then map onto the 3D context because we're moving from 2D to true to true 3D.
In fact, in January, we'll be releasing a glasses free 3D tablet. We have one of the prototypes here. So please don't anyone steal it. Maybe some people can see it later because I think the future is seamless 2D 3D kind of all this type of generation. And this is one of the things these foundation models were enabled.
So GPT Neo, JNX were downloaded 25 million times by developers which I still can't get my head around. Text is one of the hardest ways of communicating after voice, voice is the easiest when we have a conversation, communicating through images. Be it this or PowerPoint is incredibly difficult and frustrating, especially PowerPoint. But now we've basically created technology that in a second, anyone can create anything which is kind of insane because it means we can all communicate visually and soon, real time 24 frames, a second announcement, soon.
We also have in painting because sometimes you want to adjust things. So you know, we have kind of encoded mask images that kind of then adjust and we're working on technology to enable kind of prompt, to prompt in other things. So you can just describe the changes that you want and then it'll impact things to remove people, add people, you know, kind of do all that kind of thing.
I think this kind of game gets extended to things like upgrading your child's artwork through text to image because again, this is a denoising function whereby you start with initial thing. So it can be a seed of random distributions and that allows you to have the constants or you can have an image that then becomes, it's a bit creepy for why do you pick these everyone nightmares?
So, you know, but then it's not enough to just have these models, you need to integrate them into workflow. So we've got plugins for Photoshop, GIMP and others. Some of them we officially support like the Photoshop plugin other ones like the Creator plugin, etc we don't.
There is an issue in that, you know, we had to release Stable Diffusion under a different type of license, this Creative ML license, which is basically saying don't use it for unethical stuff. Don't be naughty. Um because you can use it for anything. And so we had to be a bit careful because we don't have an MIT or CC BY 4 license with our, like with our other models, it hasn't been able to be integrated fully into Blender and other things.
But in the new year, we are going to be moving to fully open source now that we have mitigations. And so I think you'll see this even explode further, similarly Stable Diffusion 2 is largely safe for work, still not fully safe for work. One of the things is that as we got to photo real, you can't really have not safe for work and kids because you can combine the concepts.
Um but this will mean it will be allowed to be used by more and more people because it's pretty safe because we kind of filtered in the safety filter and did a whole bunch of things. We did go a bit overboard though. So we kind of, if you look at the LAION dataset, you've got a p unsafe kind of score, which is probability of porn.
Porn really kicks in at 0.99 and above or 0.98. We did 0.1 which I'd like to say was deliberate, it was actually a little bit of an accident. Um but it also meant that we removed all the humans from the training dataset and we're adding them back in right now. Uh so Stable Diffusion 1 to be really soon Stable Diffusion 2 to be seen. Yeah.
Uh but it's actually also interesting because with these models, like, what is the optimal amount? A year ago, Katherine Krause and our lead coder? Jeez, it's been a last time, uh released cc12m which is one of the first conditioned models. So it had the generator model and then the text model embedded, it led to DALL-E 2 and other things.
Um that was used as the first basis of Midjourney where we funded the beta and a bunch of other things and it could create kind of really good kind of graphics. But obviously, now you've moved on dramatically, but that only used 12 million images. Yeah, it could create these things. How many images do you need to create these models? We only need 10 to fine tune them now because it's almost like teaching a high schooler or a very precocious kindergartner. I'd say I think it's gone to grade school now.
Um we're not sure. And so I think understanding how the images interact with the data. So the models interact with data now is gonna be really interesting, particularly when you talk about data augmentation and other things. So we also try to make it easy. Dream Studio is a bit crap. Now it's our implementation, Dream Studio Pro is coming out shortly whereby it's got node based editing 3D keyframing, dynamic kind of stuff. We're really trying to push and experiment how these things are interacted with.
Um so again, we've got kind of three classes, 33 displays. We've got new mechanisms of human computer interaction and we really want to test this out. This is also part of why those of you are familiar with the ecosystem. There's a lot of Google Colab notebooks soon to be SageMaker notebooks as well, um which we've had communities kind of getting around.
So all the lead developers are members of our team, we mostly hire from the community and it was just great seeing the experimentation around animation around kind of a bunch of these different things that, that enabled. And again, I think this is the thing whereby one of the things we're really trying to do is combine that AI open source and community to create a very differentiated company and see an ecosystem of company and foundations across these modalities in kind of order to do that the relationship with open source is that, you know, I looked when I was doing the United Nations COVID thing, a lot of companies promised a lot and they didn't come through and I thought this is crap and I thought about my kids and I'm like in the future AI is going to drive everything from education to healthcare. Does it really make sense that that's going to be run by one company? No, there's a super powerful technology, like I said, now we can communicate visually as a species with no barrier just literally in the last year. What's that going to lead to? I don't know. Well, that should be a commons for everyone and it shouldn't be a case of, well, this is too powerful. People not tell the technology, you know, when people say it's too dangerous, I often a very simple thing. So you don't want Indians to have this or Africans? There's no real answer to that because it's basically a racist statement, right? Because the reality is again, it's very kind of confined this technology, it's available now.
So like I said, with a two gigabyte file, you can run Stable Diffusion on your iPhone, you know, with a new version of Replay, you can run it dynamically anywhere. Uh soon you'll be able to run it incredibly fast as well on the variety of these things. And this is where people are just coding on their MacBook M1s, but we'll make it more and more accessible to everyone. I think this is a big differential as well because a lot of people will think about bigger and bigger models and bigger and bigger models are fine. But now you're seeing the combination of DL and RL to create more customized models. So these are the Instruct models. These are kind of a lot of these kind of embedding based approaches than others. And we thought that smaller, more dynamic models would work a lot better.
So we have a 60-57 billion parameter chinchilla, optimal model training at the moment. But we also released, for example, I Instruct model series. So you can take these reasonably large models which can still fit on one pod and then instruct them down to be really optimized for your use case and then shrink them down. We're also about to release things around distillation and other model optimization chaps because once they're accessible, it goes out into the community and the community develops wonderful things.
So hundreds of people have developed on Stable Diffusion. Uh in fact, like, you know, I think as of today, Stable Diffusion overtook Ethereum and Bitcoin and GitHub stars in three months, which is pretty cool. Yeah, it's much better than the web3. Let me tell you on that. Distributed AI, that's the new buzzword.
And we've seen hundreds of thousands of developers actively using it. Um hundreds of companies have emerged from it and it's not even mature, like nobody's gonna be used Stable Diffusion into three months. I know because we've got much better models that we're going to release. Um and this is the pace of development, like to be honest, I don't think we get to photo real. I mean, that's crazy, right? It's like, wow, a bit creepy also crazy.
And so yeah, this is kind of the thing whereby when you put it out, open source, people use it as part of their building blocks. And I said our business model is very simple whereby we're going to go and take exabytes of the world's content data and turn it into foundation models. And then I'm gonna remake Game of Thrones Season 8 because it was shit, you know.
Um we're gonna want to be the foundation as well whereby we got new data partnership. So we just got an exclusive license to the Bollywood data. So we're gonna have an ARR model and then movies that are exactly the same. No wait, very different kind of coming out of that. And we're building national datasets for every country as well as India GPT, Thailand GPT and all these things as well because a lot of this is kind of a monoculture whereby you don't have the data available.
But I thought Japan Diffusion for example was fantastic because they took Stable Diffusion and retrained the text encoder for Japanese. So salaryman didn't mean someone very happy with lots of money. But a very sad person, you know that regional context is only available if you make these models available. And like I said, I think we want to be that kind of platform that provides the generalized models and then kind of makes it accessible the scale and service.
So if anyone's going to build an API you probably shouldn't because we'll probably outcompete you given our plans here. Uh please use ours or just do it yourself, but it's all good. We're gonna make it accessible to everyone. Uh partnering with AWS, you know.
Um we've kind of been co-building with the SageMaker integration to give you an example. The SageMaker team worked with us to get GPT Neo on our language models on the Ultra clusters from 100 and 3 TFRs to GPU to 100 and 63 TFRs per GPU 52% utilization of 512 A100s. That was pretty amazing given, you know, EFA and kind of all of that. And when I looked forward to kind of the P5 Ds and some of the other things, it's gonna get very, very interesting on training because like I said, we train up to 1000 A100s or 1500 for some of our things and there was the access to compute as well.
So, you know, for a 13 month old company to have 4000 A100s was a pretty amazing thing. I think that's testament to the scale that AWS has and also obviously the foresight to kind of back us. Thank you for appreciate it. And now the GPU overlords I have control over more GPUs than anyone. I mean, try to tell people that's a new state of simple yachts and features everything. How many GPUs do you have? That's the question.
And then, you know, in partnership with Amazon, this is a broader thing because obviously as AI comes, Amazon itself is an AI consumer. So if you kind of look at Studios and a whole bunch of these other things, I think a lot of this intelligence will be pushed out to the edge. So you have multimodality of these very small models that are optimized. I think we can get Stable Diffusion down to 200 megabytes for example, which is again, is insane.
So you can have edge compute where I have this vision of an intelligent internet where every single person has their own model, every single country culture and these will be different sizes and they will interact with each other because they are so dynamic and you can have cross textual embeddings and things like that as well. So I think that's gonna be what you need to communicate and create brand new experiences of all different types. And that's gonna be pretty crazy given that there's probably gonna be like $100 billion put into this sector in the next few years. Like let's face it, self driving cars got 100 billion crypto got like 300 billion. They're kind of crap compared to this. I mean, how cool is it create anything you can imagine. Come on.
So yeah, I think I pressed the wrong button. How you can use Stability models today? You can download them. You can use the API well, stay tuned, stay tuned to attend the ML keynote, attend the ML keynote tomorrow. So you can learn more about some other things happening that we can't announce today.
Oh, yeah, there's some very exciting things they announce tomorrow, shall we say? Um I think literally a step change. That's a good way to put it, isn't it? It's gonna be fun. Um so using st today, you know, we've got the stable diffusion 2.0 depth image, we actually released the four times up scaler. We're about to release the eight times up scaler stable diffusion 2.1 and a whole variety of other architectures will be there. Uh we're gonna release something very nice for christmas in painting and out painting, et cetera. And we've got an absolutely packed kind of road map. So we've got in the next generation language model and kind of text data stack, you know, join a luther kind of play with that. Our code generation model cogen joined copper dot a i our representative learning lab for that. Uh we just released open elm which is a um evolutionary code generation algorithm as well. So we're doing a lot of focus on open um endedness. Also we released from that lab, our instruct data set. So you can take these large language models and use reinforcement learning to really customize them to your needs to create instruct models. Uh we've released dance diffusion, which is our audio model and we're gonna release a text condition audio model. So you can describe anything and create any type of music. A aman model will be the first one for the indians here. That'll be fun. Uh so training our video models, those will be really interesting as well. Uh though we might not need them soon and 3d models and then we're gonna be announcing fine tuning via the api and lots of api things there. So it's all been very exciting. I'm very tired. Um but i hope you are enjoying kind of this and you take it and you build amazing stuff. Like i said, we're gonna do full, we are fully multimodal, we do the whole variety of things. Amazon have been an amazing partner and now, you know, we really hope to make it available to everyone as well as share our knowledge. So we've been writing up usage guides on all of this. You know, we've been contributing to parallel cluster and a lot of these other systems as well. Um because, you know, we're taking the pain of getting all the edges out so you guys won't have to. So, thank you very much, everyone. Is it? Ok. Thank you.
Um i'm gonna do a quick show of hands uh first. Um so who's more on the application side than um let's say operations or let's say infrastructure, your application. No, not so much actually. And uh more on the infrastructure side who's managing infrastructure, who's running jobs? Oh ok. Um so let me um let me, so i'm going to discuss uh introduce you to the architecture that st dt is using to build the models. And um there a re actually quite a few components uh that they a re using. So there is compute network and obviously storage um that um so these a re the key components, so you obviously they a re the sa me components for every workload you can say, but we're actually picking a few specific services including also on the orchestration layer with partly and cloud formation. But we a re using um p four dt tron um containers as well as well as uh e fa that means also placing the instances where we want them to be placed in an easy but also storage components. So i'll go in a bit more details into each of them.
The first one is a tr cluster or what we refer as amazon two track clusters and the rp 40 instances. So you can place up to 4000 g uh gp us per uh cluster, a cluster. And those will be the aggregated in a tightly coupled fashion, meaning that you'll get full bi section bandwidth through e fa as well as a low latency typically that's actually the configuration that you want to run when you have a tightly coupled workload. For example, if you have to scale your model across, across multiple instances, so if you need to actually go over one instance, 2416, that's actually what you, what you really want to use. So the, the one property also interesting about product. Um so the truck cluster is that we um and so we actually are um we g a uh ranum. So difference compared to p four d is that the memory of um of uh on one instance can be accessed. So that actually the memory of all the chips can be accessed by one chip. So it's actually um you can actually access the full address space of the memory through the accelerators. Um the advantage is that you have 512 gigabytes that each actually accel can use because that's actually a global memory. But also you've got a near a network connectivity of 800 gigabits per second and it do, it does matter a lot for transformer based workload. Um we've seen actually some of our customers looking really for those high, higher throughput between instances. But what you are, what you have to consider also with trainum that sub is also testing if i'm not mistaken. Um you, you can actually aggregate many of those chips up to 30,000 chips into 11 large cluster. So that gives you the possibility to run not only large models but also many of them at the same time, but all this actually is a load um through um a network technologies that we call a sc fabric. If you are not familiar with it, we launched it in 2019, if i'm not mistaken, or 2008, 2018, it was one of my first years. And um what it allows you to do is um it will actually allow you to, to shortcut uh the path um between your application and the hardware. So you're gonna go through leap fabric. So it's an open source library and you're gonna tap directly using nico for example, or it can be also mp i directly calls the, the leap fabric that's gonna actually access the the device. So that's actually also what uh enable you to scale at on thousands of gpus without being impacted by high latencies and variabilities that you can see. For example, also on, on tcp. Um for example, if you are using en a or e network, that the advantage also of eff is that we are using a protocol called scalable or reliable data that would actually spray the packets in out of order across multiple paths within the network. Meaning that the, you, you, you are not blocked by your head of line. For example, the first packet did not arrive, we have to wait for the first packet to be remitted. It's not the case here, but all that actually is.
Um so there are a few other components that are important. Let me go back to the previous one. It's actually storage. It's a one that is particularly overlooked. Um i'm actually seeing a lot of its challenges from customers and um on what actually is the right hierarchy and the way you can think of it is amazon s3 is going to be your backbone. So you have the three components. So your ngme drives a six forester and amazon s3. So amazon s3 is your backbone, you will store your data, that's where you store your results, maybe checkpoints as well. And uh you will actually make the data transit for example, between um either through the cli it can be or an sdk, but also through a ear uh ear storage manager um for lester. And it will allow you to actually exchange it between s3 and ufsx for lester partition, fsx forester is your high speed storage, the single millisecond latency. Um it's much lower than s3, but also it is compatible, meaning that you don't need to transform your application, you don't need to call an sdk to actually access fsx. And the good thing also is that you can actually have, let's say one petabyte of data in uh your s3 bucket of an fs x partition. So fsx forester storage of, let's say one terabyte and that's fine actually. So you cannot use it as a hot storage and exchange on the data and work on the data that you use. So that allow you actually to create multiple fat systems, multiple clusters but always have the same source. And then the important point also that stability is using is the instant store. If you're not familiar with it, that's actually these are actually the nv in drives that are attached physically to the, to the instant to the servers. And uh i'll explain how, but they are rated in a red zero so that it's a fat disk that does appear as a fat disk that actually composed of multiple smaller and drives of one terabyte each. And you can use that for check pointing or like are you heavy transactions during computer, for example, to output the weight? Um and stability actually use all those three components for compute at the intense store data, for example, store input output checkpoints on the fsx forester and amazon s3 as um as a storage backbone. And all that you say can be quite complex actually to be if you were to create your own infrastructure, you know whether it is through cloud formation or the cl i've seen some people do that and in fact, um stability like uh all uh customers in this space are using pilot cluster.
So quick question, who's family with p cluster? 233 people? seriously. Ok. one more. Ok. thanks tyler. So pryor is actually an open source project. it's an open source tool that we that is supported by aws. you can find it here on github. you can install it via pip, for example, with python. And it will allow you to create clusters through a configuration file. In the case of stability, they actually created. So you have a head node from which you connect and submit your jobs and they created this cluster with, they are compute nodes, they are share file system and so on. And everything is defined in this config finder menu if we take a deeper dive at it, so they will, you will define what the head node. So that's actually where you will connect. So i'm going to access my node and i'm going to submit my computational jobs in this case training. And then i'm gonna define the compute resources which will be four de or it can be for exa ranum in instances. And i'll define that i can even define multiple cues and then you can define the storage like that. And you create you, we will run one command using this tml file as an input and say, create to take a few minutes in if you were to actually build such cluster yourself, it would take probably a few months. But um it, that's actually literally you can deploy that. And uh in fact, um if you look a t um the st um github called, um i think i believe that's sub dt hpc, you can actually uh create a cluster i identical to uh what um what stability is using.
So there are a few things that obviously they uh they have done on top of that. So they are working with a lot of different users coming from different institutions and companies. And uh authentication is important because you need to enable users to access, also control who's accessing what understand also, you know, i notify who's accessing what um which cursor which resource and they are using an authentication using. Um um so uh aws activity directory and it's connected directly to pilot crystal. So if they create a new cryor, that's fine. It's actually connected to act directory, users can connect, there is no access required or let's say tuning required. And also it allows them to keep track of jobs across regions and across across clusters who is using what and when another important property that is o of a pilot cluster and the cluster that st bt has built is that you can actually have multiple cues and multiple resources for exa mples. If you have um um a training workload that is using p four d, you can have your job runnings and you can have a pre processing workload that is running on the another queue with another set of resources for exa mc c i or for example, in in france workload that can be executed on the inferential or on the g five instances. So you can have those different queues within the same cluster or even create multiple clusters for different teams. So that's also one of the properties for um um that, that the tool offers and the way also stability structured, it is that they did some additional configuration that actually richard van co um who is the lead architect, lead hpc architect. If i'm not mistaken for, i guess. yes. So it actually it created a lot of packages scripts to configure the cluster. So some of the cluster, some of the scripts will be deployed cluster wide. But you can also deploy scripts either for the head node or for the computer nodes for exa m on the head node. You would say, you know what i'll need to actually add some configuration for s that's a batch schedule. So the job schedule um i need to actually call an two capacity reservation because i have actually a reservation for this kind of instance. And uh in the case of uh the computer nodes you will install for exa m gp u monitoring tools which you don't need on the head node because it's, you know, it's using, it's ac c 59 xx large, it's ac pu node. But you actually do need that for exa m for uh for the gp u to all the to monitor as well, you know of um i monitor i er and there a re some scripts that would be also specific again for the head node, for the computer node and so on. And all that is actually executed a t launch. When you create a cluster, you can also uh re execute them during update. Yes, you can update a cr cluster alive. And um then the users will actually install their own uh packages, uh their own their own tools on top of that once the cluster has been created. But there is also one property that is important about this tool is that in the case of stability, i was actually talking to rich and tommy, you know what p we have actually two queues. Uh we, well, more than that, but we, we have actually 22 kind of queues. So we have the high priority queues, for example, for the training and we have low priority jobs, for example, you know, which can be for preprocessing. Um for example, two xz md on ac in the test set. And what happens is they are gonna run those low priority jobs which typically last uh last less than 48 hours and can be interrupted. And then they are gonna, they a re gonna actually um let's say a job is gonna be a big job is gonna be executed. They a re gonna inter those jobs, we reuse them automatically. In fact, it's managed directly by sm so it's net it's actually um something that you can use yourself and um then upon once the uh the job for m the training job has been executed, then the prior low priority jobs will be again and then we can, they can be terminated. If not, if um if a new training job, higher priority job uh happens as well, we're about 93 93% utilization on the, on the dresser.
So i come from the on prem uh space um inin the linga space uh in a lingus utilization is about the sa me, but that's a very good rate of its utilization. 93% is really high. I think it's usually about 80 85 80 85. Yeah, that's correct. 8085 for the main supercomputers and we're talking about national level, a kind of supercomputers. So it's um yeah, it's quite an impressive um um uh utilization rate. And uh that's actually by managing aggressively the resources not only obviously through uh job, you know, job optimization q os. So quality of service that can, that is actually based in uh inin slum, but also by monitoring the jobs for exa m by monitoring the gp us. So if there is no, they have actually uh they are using dc gm. By the way it's on, you can look at it uh to uh monitor for exa mle is my gp u sign for exa mle when i get an instance, do i have the right topology? Uh do um do i have the right band width as well. I'm gonna, before running a job, it's automated but also to debug for exa mp to collect some errors. Uh in case there is a hint that we may have an issue in an instance. So then they will actually discard it if an error is detected. Um recharge actually automatically through a script will exclude, puts instance in an exclude list and then actually automatically send a ticket through the a re you, you have premier support if i'm not mistaken uh to, to the c two tea ms and there is also um um utilization. Um so, so um the tea is also monitoring the utilization of their resources through two ways. So cloud watch as well as uh primitives and grafana. Actually, this one is the main uh for gp u resource usage as well as um cpu storage, uh storage throughput and so on. So it's really about, you know, building not only a cluster that is accessible but can also be monitored and optimized. Um and it's i mean, so utilization rate is quite, quite impressive and uh you can access all these files. Um so you can take a look a t these uh references, um stability, publish your class configuration as well as the different tools. It's uh quite complete uh from what i can tell. And i highly recommend that you take a look at this.
Is that being said, the hand, i don't know if that. But earlier, we talked about how the uh the achilles heel of a i has been not being able to do hands and this is from staple division two, staple division two, right? So if you take a look at some of the more recent reddit threads in staple diffusion, you'll actually see there's a ton of new hands being generated and it's quite amazing. Um so this was gene by division two. Um imad, i wanna thank you for being up here, pierre. Thank you for doing this with us. Um we have about eight minutes left so we wanna take some time to answer any questions that you have.
上一篇: 二分类损失 - BCELoss详解
下一篇: LYT-Net——轻量级YUV Transformer 网络低光照条件图像修复
本文标签
How Stable Diffusion was built: Tips and tricks to train large AI models
声明
本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。