Stable Diffusion: Generating Images From Words
For the past few months or so, I’ve been noticing a lot of news stories about DALL·E 2, an AI image generator that uses GANs to create images from prompts. It was private for quite a while, and there were some similar, less powerful projects that were open. I played around with them a bit but I was waiting for a general release. I ended up getting into the DALL·E 2 beta a few weeks ago and last week I saw news that there was a new release of another project called Stable Diffusion, so I installed it on my MacBook. The results really blew me away!
Getting it working
It wasn’t too bad getting the release of Stable Diffusion working on my Mac as I just went through some of the steps on this site I found. There were some gotchas, like needing to install Rust to compile something and maybe a few files to change around, but for the most part it worked pretty quickly. I could probably have written down the steps but I’m sure it will be a lot more simple when they add more specific support for Apple Silicon.
Right now it takes about 3 minutes to create an image with my M2 MacBook Air, which is kind of slow. I ended up using Google Collab which lets you use their GPUs for free. The next day though, I found I was rate limited. So I moved over to Kaggle which at least tells you what your GPU limits are, and they seem pretty reasonable (like 30 hours a week at least). The Kaggle notebook has been my main workflow so far because it’s quite fast (like 10 times faster than my Mac) and the short feedback loop really helps with coming up with prompts.
Prompt Engineering
So What is a prompt anyway? A prompt is what you type in to the image generator to describe what you want it to create for you. If you say you want a picture of a cat, it’ll likely come up with a pretty good cat. But if you want, you can also describe the cat in more detail to get a more specific image. You could add the breed, for example, or say it’s a robot cat, or make it sit on a park bench. There are so many possibilities of what kind of cat to show that the prompt ends up being incredibly important to get what you want.
There’s been a lot of research done on prompt engineering. Some of it feels like a shortcut, by asking for an image in the style of a famous artist. You can also just describe the medium of the image, like a water color or illustration. I’ve noticed people on Reddit adding words like “trending on artstation” which I guess is a way to suggest that the image is aesthetically pleasing to a majority of people. You can also say something was painted badly, which is kind of hilarious.
Part of the fun of playing around with these tools is that they’re so new that it’s possible to find words that can create certain images that no one else knows about. For me, it brings back the feeling of being on the early internet, when not everything was indexed by Google to the point where there were no hidden gems. Someone should bring back “Cool Site of the Day” but for prompts!
Different Techniques
I’ve learned that there are a few different techniques to make images, and I’m learning quite a few more. The simplest one is text to image, which I just described. The way I understand it is that basically you start with some random noise, and then two AIs work to alter the noise to turn it into the image you want by iterating changes and measuring how closely the image matches your description. That’s probably really simplified and maybe wrong but whatever, I’m not an AI engineer.
Another way to create images is to start with a base image, and also feed the AI a prompt. Since you can choose the starting image (instead of just random noise), there’s a better chance that the image converges to something that resembles your input image. This gives you quite a bit more control over the final image’s general shape, composition, etc. You could feed it stick figures or a photo from your phone. I’ve seen people turn stick drawings into D&D character portraits using this technique.
I tried this technique out by using a photo of my dog, Sodapop, sitting in the grass. The picture is pretty good, but it’s not award winning or anything. I fed the text “a watercolor illustration of a black and white cardigan corgi sitting in the middle of a green flowery meadow in front of an orange ball, masterpiece, big cute eyes”. I didn’t start with that, but I kept changing it to try and get an image that I wanted.
I also played around with different strengths and amounts of iterations. I found that if I used too many iterations, the image didn’t really resemble Sodapop anymore. He’s a black and white Corgi, which is less common, so there’s probably more of a bias towards the sable and white ones. One thing I learned is that it’s better to just generate a huge number of images and then pick the ones you like. You can save the random seed value and use it to refine the image further as well. There were a lot of really terrible looking corgi watercolor images which my computer is full of now. But there were also some fairly good ones too! The power with this AI is that it’s pretty cheap to just make more images until you get what you want.
Future Techniques
There is another technique I tried recently where someone tried to create a bigger image (right now most video cards can only do 512x512 and that’s what the model is trained on) by creating an image, upscaling it and then running the image to image process on 9 square parts of the upscaled image. When I tried this, I found that it added weird artifacts into each square piece. It was recursively trying to fit the whole prompt into each square. My prompt was a garden, and it basically tried to add a garden into each subsquare of the image. This could have been due to a current bug where the random seed on Mac doesn’t really work, but I don’t have the hardware to try it on a non-Mac right now.
I’ve had a lot more fun playing around with this image generation stuff than I have in a long time with technology, so I ordered a new graphics card so I can iterate on things more quickly on my own infrastructure. There’s something really magical about using a model file that’s only a few gigabytes to basically create any image you can think of. If my internet connection ever goes down for the count, this could be my main source of entertainment.
There’s a bunch of other things I want to try. There’s a technique called “Textual Inversion” where you can sort of re-train (but not really) the model to use a personalized word. I could do this with Sodapop so I stop getting Pembroke Corgis when I want my Corgi. I was also wondering if I could use it with pictures of myself, since Stable Diffusion seems to work really well with making images with well known celebrities in them.
When I first saw this technology I figured it would be good for creating blog post images (which obviously it was for this post). I’m also envisioning things like services for creating customized watercolor portraits for your dog, or custom fantasy avatars for a person. I think people have just barely scratched the surface here so hopefully there’s a lot more interesting stuff coming up.