Project 5: Fun With Diffusion Models!

Tej Bade, tbade12@berkeley.edu

Part A: The Power of Diffusion Models

0. Setup

In this project we use the DeepFloyd IF diffusion model, which has two stages. We feed in three prompts and generate images for the three prompts. I used random seed 1234 (which will be the seed used for the remainder of the project). I set the number of inference steps to 20. The results are shown below.

The results are of good quality and fit the prompt well. I then changed the number of inference steps to 5. The following images show the sampled images with num_inference_steps=5.

Decreasing the number of inference steps to 5 has made the output have less quality and more noise.

1. Sampling Loops

For this section we use the following test image.

1.1 Implementing the forward process

We first implement the forward process, which means taking a clean image and adding noise to it. Given a clean image x0x_0 and a timestep tt, we can generate a noisy image xtx_t by the following equation

for precomputed “alpha_bar” values. The results of the test image with t = 250, 500, 750 are shown below.

1.2 Classical Denoising

One classical method for denoising the images is to use a Gaussian blur. For t = 250, 500, 750 we use (5, 1), (7, 1), and (7, 2) as our (kernel size, sigma) values respectively.

1.3 Implementing One Step Denoising

We can use our equation in section 1.1 to solve for x0x_0 given xtx_t and tt. This is called one-step denoising. This gives us the following denoised images for t = 250, 500, 750

1.4 Implementing Iterative Denoising

Instead of denoising all at once, we can use the following equation to iteratively denoise.

Using this algorithm, we get the following results for every 5th loop of the denoising process. At the end is the final predicted image using this and the past two methods.

1.5 Diffusion Model Sampling

We can use our method of iteratively denoising in section 1.4 to generate images from scratch. To do this, we pass in random noise into our method. 5 samples are shown below.

1.6 Classifier-Free Guidance

In order to improve image quality, we can pass in the embeddings for the prompt "a high quality photo.” We can find noise estimates when we condition and don’t condition on the prompt and then find a final noise estimate with equation below (gamma is set to 7).

This method gave me the following 5 sampled images which are indeed higher quality photos.

1.7 Image-to-Image Translation

In this part, we take an image, noise it a little, and run the classifier-free guidance method from above with the "a high quality photo” prompt using starting indices of [1, 3, 5, 7, 10, 20]. This effectively creates a series of images that gradually becomes closer to the original.

I also found two other images from the web to test with.

The same process above was applied to these images.

1.7.1 Editing Hand-Drawn and Web Images

We can apply the process above to nonrealistic images too. Shown below are the edits for a web image.

Original
“Edited” images

I also created the following hand-drawn images.

Original
“Edited” images
Original
“Edited” images

1.7.2 Inpainting

To implement inpainting, we create a mask mm and replace xtx_t with the following at every timestep.

The masks are shown on the left and the inpainted images are shown on the right.

1.7.3 Text-Conditioned Image-to-Image Translation

In this section we do the same thing from 1.7.1 but change the conditional prompt to “a rocket ship”. This causes the edits to gradually come to the original image but also try to incorporate a rocket ship.

Shown below is the same process with the rocket ship prompt applied to my two web images.

1.8 Visual Anagrams

The following anagrams were created using two conditional prompts, with one prompt being trained on the image when flipped upside down. I created three anagrams.

  1. “an oil painting of people around a campfire” and "an oil painting of an old man”
  1. “a photo of the amalfi cost” and “a photo of a man”
  1. "a photo of a hipster barista” and “a photo of a dog”

1.9 Hybrid Images

Finally, we can create hybrid images, which show one prompt close up and another from farther away. The captions for the images show the “farther away” prompt first and the “close up” prompt second.

“a lithograph of a skull” and “a lithograph of waterfalls”
"an oil painting of a snowy mountain village” and "an oil painting of people around a campfire”
“a rocket ship” and “a pencil”

Part B: Diffusion Models from Scratch

Part 1: Training a Single-Step Denoising UNet

In this section, we build an Unconditional UNet from scratch. We use the MNIST dataset and generate training data pairs by noising the images with the following equation.

Shown below are a few MNIST digits with differing levels of noise.

We then train the UNet on the dataset using an Adam optimizer with learning rate 1e-4. The sigma used for this process is 0.5. The following plot shows the losses for each step in the training process.

I checkpointed the model during the first and fifth epochs so we can see how well the model denoises at these stages. I visualized inputs, noised images, and denoised images for epochs 1 and 5.

We use sigma = 0.5 for training so we can see how well the model performs for images of different sigma values (also known as out-of-distribution testing).

Part 2: Training a Diffusion Model

Adding Time-Conditioning to UNet

We now train a UNet to iteratively denoise an image by injecting a scalar t into our neural network pipeline. Using an Adam optimizer with an initial learning rate of 1e-3 and an exponential learning rate decay scheduler, we train this new model on our dataset. Shown below are the losses for each step.

We can now use the model to sample images by passing in random noise. I sampled 40 images using the model state from epoch 5 and 20 of the training process.

Adding Class-Conditioning to UNet

We know that there are 10 digits or classes so we can use our previous model and condition on class as well. Doing this yields the following plot of loss vs. training step.

Our sampling method employs classifier-free guidance to try and sample an image from a particular class. For each digit from 0 to 9, I sampled four images and displayed the results below (for epochs 5 and 20).