CS 180 Project 5

Name: Brandon Wong

Part A: The Power of Diffusion Models!

The purpose of project 5A is to get familiar with the usage of diffusion models by implementing diffusion sampling loops and using them for other tasks such as inpainting and creating optical illusions. This was done using the stages 1 and 2 of the DeepFloydIF diffusion model. Before each time results were obtained, a seed was set. This part was done in Colab. The unet instance from stage 1 was used for most sections, although section 0 looked at examples with different numbers of inference steps.

0 | Setup

This section was mainly just looking at results from the DeepFloydIF diffusion model as a point of comparison as an example of what is partially being implemented throughout the rest of this project. Results at 1 , 5, 10, 20, 40, and 100 inference steps (in stage 1) were observed, shown below. The seed used to ensure reproduceability was 714. Some results of adding different numbers of steps in stage 2 were also looked at (not shown), but as stage 2 is just upscaling from the results of stage 1 it really only affected image clearness and quality. Increasing the number of inference steps increased the general reasonables of the image, although improvements really started decreasing at around 10 steps. It was interesting to see how later steps moved in the direction of art rather than attempt to be realistic.

1 Inference Step

5 Inference Steps

10 Inference Steps

20 Inference Steps

40 Inference Steps

100 Inference Steps

1 | Implementing the Forward Process

In this section, I implemented a forwarding function to add noise at different timesteps based on the alpha values at different timesteps selected by the DeepFloydIF model. The equation to calculate x_t, the image at different timesteps, is shown below. Examples of an image with noise added at timesteps 250, 500, and 750 are also shown below alongside the original image without noise.

Forwarding Process Definition and Equivalent Equation

64x64 Image of the Campanile

Noise at Step 250

Noise at Step 500

Noise at Step 750

2 | Classical Denoising

Once the forwarding results at the three timesteps were obtained, I used gaussian blur filtering to see what those results would look like, shown below.

Blur at Step 250

Blur at Step 500

Blur at Step 750

3 | One-Step Denoising

The use of a pretrained UNet model for denoising was also used to see what those results would look like if noise was reduced in one step.

UNet One Step at Step 250

UNet One Step at Step 500

UNet One Step at Step 750

4 | Iterative Denoising

In this section, I finally used iterative denoising (diffusion) with step skipping to speed it up. Starting at step 990, steps of size 30 were taken until step 0, the original image, was obtained after starting from a noisy image. In this specific section, the iteration started from the tenth timestep to ensure the resulting image was close to the original. The equations used to find the predicted previous image at each step is shown below.

Iterative Process Equation

This process was run on a noisy campanile image. The resulting images at every fifth timestep are shown below. The original image, resulting image, UNet result, and gaussian blur result are all also shown below.

Iterative Prediction at Step 90

Iterative Prediction at Step 240

Iterative Prediction at Step 390

Iterative Prediction at Step 540

Iterative Prediction at Step 690

Noisy Image

Iterative Result

One Step Result

Gaussian Blur Result

5 | Diffusion Model Sampling

Instead of starting from the tenth timestep, this time the iterative process was started from the first one with a completely random image. This allowed for generation of completely new images, with the prompt used being "a high quality image" for the input to the model at each step.

Random Image 0

Random Image 0

Random Image 0

Random Image 0

Random Image 0

6 | Classifier-Free Guidance (CFG)

To improve the images, classifier free guidance was added to the iterative denoiser by computing the estimated noise both conditioned on a text prompt and unconditional. The overall noise estimate is the unconditional estimate plus a scale factor multiplied by the difference between the estimation on a prompt and the unconditional estimate. Five examples the the result of this are shown below. A scale factor of 7 was used with the prompt being "a high quality photo" each time.

CFG Random Image 0

CFG Random Image 0

CFG Random Image 0

CFG Random Image 0

CFG Random Image 0

7 | Image-to-Image Translation

Noise was added to images in varying amounts up to certain timesteps before running the CFG iterative denoising function on them to see what results would occur from denoising at certain points. This was done on three images; the campanile, a picture of a mug with cal on it, and a picture of a plushy. These results are shown below.

Campanile 1

Campanile 3

Campanile 5

Campanile 7

Campanile 10

Campanile 20

Campanile Original

Cal Mug 1

Cal Mug 3

Cal Mug 5

Cal Mug 7

Cal Mug 10

Cal Mug 20

Cal Mug Original

Plushy 1

Plushy 3

Plushy 5

Plushy 7

Plushy 10

Plushy 20

Plushy Original

7.1 | Editing Hand-Drawn and Web Images

The same process as the previous part was done in this part, except this time it was done on one image obtained from the web and two very simple badly done/very vague drawn images. The first drawn image was kinda meant to be a sailboat metal battleship and the second was supposed to be a starry sky. The web image is from here.

Web Image

Drawn Image 1

Drawn Image 2

Web Image Results

Drawn Image 1 Results

Drawn Image 2 Results

7.2 | Inpainting

By forcing the image at each step to have the forward prediction from the original image at all points that are not the mask, inpainting can be done, where only a select portion of the image develops a random result based on the image and the rest remain the original image. This is shown below on the campanile, mug, and plushy. The campanile ends up with a new top, the mug ends phased out of the image (a bit badly), and the plushy gets a nose.

Inpainting Process

Inpainted Campanile

Inpainted Mug

Inpainted Plushy

7.3 | Text-Conditional Image-to-Image Translation

This runs the same thing as the first part of 7, but it runs guided by a prompt instead of nothing. This allows for interesting changes as the model attempts to obtain an image based on the prompt from the original image to varying extents as shown below on different stages of added noise on the campanile, mug, and plushy.

Prompt Based Translation from Campanile, Starting from Forwarding to Different Timesteps

Prompt Based Translation from Mug, Starting from Forwarding to Different Timesteps

Prompt Based Translation from Plushy, Starting from Forwarding to Different Timesteps

8 | Visual Anagrams

In this section, anagrams from two different prompts were developed. The noise from transforming an image from one prompt and the noise from transforming the flipped image from another promt were averaged for a combined transformation at each step until the final result appears to be one image right side up and another image upside down. This is shown below.

"an oil painting of people around a campfire" and "an oil painting of an old man"

"an oil painting of a snowy mountain village" and "a photo of the amalfi cost"

"a rocket ship" and "a lithograph of waterfalls"

9 | Hybrid Images

In this section, the low frequencies of the noise from running the model on the current step on one prompt and the high frequencies from running the model on the current step on another prompt were combined to form an overall noise to remove, which eventually reaches a hybrid image after all the iterations. A kernel size of 33 and a sigma of 2 was used. The results are shown below.

"a lithograph of a skull" and "a lithograph of waterfalls"

"a photo of the amalfi cost" and "a puffy cloud

"a space battleship" and "a chalk drawing of a skyscraper"

Part B: Diffusion Models from Scratch!

After having had fun working with with prebuilt models in Part A, in this part I implemented some basic versions of the model from the previous section. Three versions of UNet were made; single step, time conditional, and time and class conditional. Each was was an improvement from the previous one, built using pytorch to be able to at dirst denoise and then generate results from and based on the MNIST dataset. The first step was to make a noise generator function to train the models to denoise, with the results at different σ values shown below. Everything was done in Google Colab.

Noise at Different σ Values

1 | Training a Single Step Denoising UNet

The purpose of section 1 was to implement a basic version of UNet that could be used to try to denoise images in a single step, the simplest form possible.

1.1 | Implementing the UNet

To implement UNet myself, using the nn.Module class from pytorch, I first made classes for basic blocks that I then put together in a class for the UNet model. The basic blocks generally consisted of a convolution of some sort followed by a batch norm and then gelu, although there were several blocks that used things such as average pool and concatentation instead as those are also part of UNet. The number of channels in each section were generally some multiple of a hidden dimension hyperparameter.

1.2 | Using the UNet to Train a Denoiser

After establishing a UNet model, I trained it on the MNIST training set. This was done over five epochs with batches of 256 images at a time. L2 loss, also known as the mean square error, was used to train the model. The model was trained on images noised with a σ of 0.5. The Adam optimizer was used to optimize the model with a learning rate of 1e-4. The hidden dimension for this model was set to 128. Overall, it took around 8 minutes with a final loss of 0.008393 on the last batch of data. After obtaining the trained model, testing was done to see how effective it would be on images with varying levels of noise added.

1.3 | Results

The results of the model are shown below. The loss over each step is the first image, and the only graph, shown below. The loss curve in the first epoch and fifth epoch are also shown to compare to each other. After that, the samples used for testing and observing the differences in effectiveness between training on one epoch and another epoch are shown, followed by the results at those two epochs. Finally the out-of-distribution results, where the model was testing on images noised with values other than 0.5 are shown beneath, with artifacts showing up significantly starting around σ = 0.8.

Plot of Loss per Step for UNet

Plot of Loss per Step for Epoch 1

Plot of Loss per Step for Epoch 5

Samples for Testing

UNet Results at Epoch 1

UNet Results at Epoch 5

Out-Of-Distribution Results 1

Out-Of-Distribution Results 2

Out-Of-Distribution Results 3

2 | Training a DDPM Denoising UNet

In this part, time conditioning and then classifier conditioning are both added to the UNet model put together previously to turn it from just a denoiser to a diffusion model, giving it the ability to iteratively denoise an image such that it can even generate new images from noise. This can be done using a combination of the equations used to put together the iterative denoiser in Part A and L2 loss on predicted noise, rather than the resulting image, generated by the UNet model from the first section with some additional fully connected blocks at specific parts to enable the model to take in a time and later classifier as inputs to the model. The β value for the chosen 300 timesteps range from 0.0001 to 0.02 evenly spaced apart, and the α values are the cumulative product of 1 minus the β values up to the timestep of each one.

2.1 | Adding Time Conditioning to UNet

To add time conditioning to UNet, fully connected blocks, which are just a linear function followed by a gelu function run on a timestep t, were added to the unflatten result and first upblock result. This allowed the model to account and adjust for the timestep whend deciding how much and what sort of noise there is. the fully connected blocks are run on a normalized t value to maximize effectiveness.

2.2 | Training the UNet

To train the UNet, a training algorithms is run using equations from Part A to calculate the resulting noise values at different timesteps. A batch size of 128 and 20 epochs were used to maximize the effectiveness of the new UNet. A hidden dimension of 64 was selected. The Adam optimizer was used with an initial learning rate of 1e-3, and a scheduler with a learning rate decay of 0.1 ^ 1/num_epochs. The loss to steps curve is shown below, with almost all the improvement at the beginning and everything after being quite gradual.

Plot of Loss per Step for Time Conditioning

2.3 | Sampling from the UNet

Another algorithm was needed to properly sample the model, based on the iterative denoiser of Part A. The results of this model allowed for both the cleaning up of noisy images, including from a separate test set, and generation of new ones close to the original MNIST dataset. Examples of generating new images at epoch 5 and epoch 20 are shown below. The results of epoch 20 are slightly improved from the results of epoch 5, which makes sense as improvements became significantly more incremental after the first couple of epochs.

Time-Conditioned UNet at Epoch 5

Time-Conditioned UNet at Epoch 20

2.4 | Adding Class Conditioning to UNet

To complete the basic diffusion model, I added classifier based conditioning by ensuring the model could take a class as an input then use a one hot encoded result as an input to the fully connected block which would then be multiplies with the unflatten and upblock results before the time gets added to them. During training, the model is also set to ignore the classifier a tenth of the time. The classifier is set up to take ten classes, the digits from 0 to 9, due to it being the MNIST set. The training algorithm is just slightly modified from the time conditioning model to take in the class on top of the image and timestep. The loss of this model is shown below.

Plot of Loss per Step for Class Conditioning

2.5 | Sampling From the Class-Conditioned UNet

One notable thing about using classifier based guidance is that on its own it is not very good (seen in Part A). As such, the unconditional noise predictions are needed as part of the sampling portion to ensure good results as seen in Part A. The γ value for this was set to 5.0. The results of the addition of class conditioned guidance is shown below for epoch 5 and epoch 20. The differences are less stark than in previous examples, but the epoch 5 results are generally a bit thicker and one of them has a leftover artifact, just standing out on its own.

Class-Conditioned UNet at Epoch 5

Class-Conditioned UNet at Epoch 20