Project 5: Fun With Diffusion Models!
Tej Bade, tbade12@berkeley.edu
Part A: The Power of Diffusion Models
0. Setup
In this project we use the DeepFloyd IF diffusion model, which has two stages. We feed in three prompts and generate images for the three prompts. I used random seed 1234 (which will be the seed used for the remainder of the project). I set the number of inference steps to 20. The results are shown below.
The results are of good quality and fit the prompt well. I then changed the number of inference steps to 5. The following images show the sampled images with num_inference_steps=5.
Decreasing the number of inference steps to 5 has made the output have less quality and more noise.
1. Sampling Loops
For this section we use the following test image.
1.1 Implementing the forward process
We first implement the forward process, which means taking a clean image and adding noise to it. Given a clean image and a timestep , we can generate a noisy image by the following equation
for precomputed “alpha_bar” values. The results of the test image with t = 250, 500, 750 are shown below.
1.2 Classical Denoising
One classical method for denoising the images is to use a Gaussian blur. For t = 250, 500, 750 we use (5, 1), (7, 1), and (7, 2) as our (kernel size, sigma) values respectively.
1.3 Implementing One Step Denoising
We can use our equation in section 1.1 to solve for given and . This is called one-step denoising. This gives us the following denoised images for t = 250, 500, 750
1.4 Implementing Iterative Denoising
Instead of denoising all at once, we can use the following equation to iteratively denoise.
Using this algorithm, we get the following results for every 5th loop of the denoising process. At the end is the final predicted image using this and the past two methods.
1.5 Diffusion Model Sampling
We can use our method of iteratively denoising in section 1.4 to generate images from scratch. To do this, we pass in random noise into our method. 5 samples are shown below.
1.6 Classifier-Free Guidance
In order to improve image quality, we can pass in the embeddings for the prompt "a high quality photo.” We can find noise estimates when we condition and don’t condition on the prompt and then find a final noise estimate with equation below (gamma is set to 7).
This method gave me the following 5 sampled images which are indeed higher quality photos.
1.7 Image-to-Image Translation
In this part, we take an image, noise it a little, and run the classifier-free guidance method from above with the "a high quality photo” prompt using starting indices of [1, 3, 5, 7, 10, 20]. This effectively creates a series of images that gradually becomes closer to the original.
I also found two other images from the web to test with.
The same process above was applied to these images.
1.7.1 Editing Hand-Drawn and Web Images
We can apply the process above to nonrealistic images too. Shown below are the edits for a web image.
I also created the following hand-drawn images.
1.7.2 Inpainting
To implement inpainting, we create a mask and replace with the following at every timestep.
The masks are shown on the left and the inpainted images are shown on the right.
1.7.3 Text-Conditioned Image-to-Image Translation
In this section we do the same thing from 1.7.1 but change the conditional prompt to “a rocket ship”. This causes the edits to gradually come to the original image but also try to incorporate a rocket ship.
Shown below is the same process with the rocket ship prompt applied to my two web images.
1.8 Visual Anagrams
The following anagrams were created using two conditional prompts, with one prompt being trained on the image when flipped upside down. I created three anagrams.
- “an oil painting of people around a campfire” and "an oil painting of an old man”
- “a photo of the amalfi cost” and “a photo of a man”
- "a photo of a hipster barista” and “a photo of a dog”
1.9 Hybrid Images
Finally, we can create hybrid images, which show one prompt close up and another from farther away. The captions for the images show the “farther away” prompt first and the “close up” prompt second.
Part B: Diffusion Models from Scratch
Part 1: Training a Single-Step Denoising UNet
In this section, we build an Unconditional UNet from scratch. We use the MNIST dataset and generate training data pairs by noising the images with the following equation.
Shown below are a few MNIST digits with differing levels of noise.
We then train the UNet on the dataset using an Adam optimizer with learning rate 1e-4. The sigma used for this process is 0.5. The following plot shows the losses for each step in the training process.
I checkpointed the model during the first and fifth epochs so we can see how well the model denoises at these stages. I visualized inputs, noised images, and denoised images for epochs 1 and 5.
We use sigma = 0.5 for training so we can see how well the model performs for images of different sigma values (also known as out-of-distribution testing).
Part 2: Training a Diffusion Model
Adding Time-Conditioning to UNet
We now train a UNet to iteratively denoise an image by injecting a scalar t into our neural network pipeline. Using an Adam optimizer with an initial learning rate of 1e-3 and an exponential learning rate decay scheduler, we train this new model on our dataset. Shown below are the losses for each step.
We can now use the model to sample images by passing in random noise. I sampled 40 images using the model state from epoch 5 and 20 of the training process.
Adding Class-Conditioning to UNet
We know that there are 10 digits or classes so we can use our previous model and condition on class as well. Doing this yields the following plot of loss vs. training step.
Our sampling method employs classifier-free guidance to try and sample an image from a particular class. For each digit from 0 to 9, I sampled four images and displayed the results below (for epochs 5 and 20).