Introduction to Stable Diffusion and Its Applications

Stable Diffusion is a deep learning, text-to-image model that was released in 2022. The model is used primarily for generating detailed images based on text descriptions, but it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. It was developed by the start-up Stability AI in collaboration with a number of academic researchers and non-profit organizations. The model's code and weights have been released publicly, making it accessible for anyone with a modest GPU with at least 8 GB VRAM.

The development of Stable Diffusion was led by Patrick Esser of Runway and Robin Rombach of CompVis. The project was supported by EleutherAI and LAION, a German nonprofit that assembled the dataset on which Stable Diffusion was trained. Stability AI raised US$101 million in a funding round in October 2022, led by Lightspeed Venture Partners and Coatue Management.

Stable Diffusion is a latent diffusion model (LDM), a type of diffusion model (DM) developed by the CompVis group at LMU Munich. The model is trained by applying and then removing Gaussian noise on training images in a process that can be thought of as a sequence of denoising autoencoders. It consists of three parts: a variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE compresses the image from pixel space to a smaller dimensional latent space, then Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block denoises the output from forward diffusion backwards to obtain a latent representation, and then the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality, with the encoded conditioning data exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space.

The model was trained on the LAION-5B dataset, which consists of pairs of images and captions taken from Common Crawl data scraped from the web. This dataset contains 5 billion image-text pairs classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score.

Diffusion models like Stable Diffusion have emerged as a leading family of deep generative models, surpassing the previous dominance of generative adversarial networks (GANs) in image synthesis tasks. They've shown potential in various domains, including computer vision and natural language processing.