Project 5: Advanced Image Processing and Analysis
In this project, we delve into advanced image processing techniques and analysis. We'll explore various methods and their applications in computer vision and image understanding.
1.0 Setup
We used the pretrained DeepFloyd diffusion to run some tests. The interesting thing here is that we can change the detail at which these pictures are made with the number of inference steps. Here are two examples. The random seed I am using is 4.
man with a hat (inference 20)
rocket (inference 20)
oil painting of a snowy village (inference 20)
man with a hat (inference 40)
rocket (inference 40)
oil painting of a snowy village (inference 40)
As we can see, the higher the inference, the more detail we get. This is best seen in the man with the hat, where there are more wrinkles and details on the shirt.
1.1 Implementing the Forward Process
The first step we did is to add noise to our image. The way we did this is through an equation, x_t = sqrt(a_t)x_0 + sqrt(1 - a_t)e, where e is sampled from a gaussian. Here, we have the image sampled at 3 different t's: 250, 500, 750.
Berkeley Campanile
t = 250
t = 500
t = 750
As we can see, the higher the t, the more noisy the image.
1.2 Classical Denoising
Now, we'll try using classical methods, such as Gaussian blur filtering. This is just using the standard torchvision gaussian blur.
t = 250
t = 500
t = 750
t = 250, gaussian blur denoised
t = 500, gaussian blur denoised
t = 750, gaussian blur denoised
As we can see, the results are not so great.
1.3 One-Step Denoising
Now, we'll use one step denoising to try and denoise it better. The way we'll do this is using the pretrained UNet thorugh using our equation from earlier, where the UNet will predict the error.
t = 250
t = 500
t = 750
t = 250, One step denoised
t = 500, One step denoised
t = 750, One step denoised
As we can see, this is a much better result than when classical methods.
1.4 Iterative Denoising
Now, we'll try iterative denoising. Instead of trying to just guess the noise in one step, we'll try to get an estimate of a previous timestep, and predict that. The way we would do that is through another mathematical formula: x_t_prime = sqrt(a_bar_t_prime) B_t / (1 - a_bar_t) * x_0 + sqrt(a_t)(1 - a_bar_t_prime) / (1 - a_bar_t) x_t + v_sigma, where v_sigma is a random noise that the model predicts.
t = 90
t = 240
t = 390
t = 540
t = 690
Berkeley Campanile
Iteravely denoised
One step denoised
t = 750, gaussian blur denoised
Here, we the image at different steps. As the final results show, the iteratively denoised has the best results and closest to original.
1.5 Diffusion Model Sampling
Now, we'll input pure noise into the model, and see what the model generates from that.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
These are some pretty cool things that the model generated from pure noise!
1.6 Classifier-Free Guidance
The generated pictures can be improved by using Classifier Free Guidance. This means that we can we compute both a conditional and unconditional noise estimate, and by combining the two, we will get better pictures! We used a gamma value of 7
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
As we can see, the pictures generated are definitely better than what we had earlier.
1.7 Image to Image Translation
Now, we use the iterative denoising function with CFG to generate images that are close to our source image by take our original image, injecting a little bit of gaussian noise, then specifying different t values to start at. As we can see, the later the t value, then the closer we get to our original image.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
Here are two different images that I chose on my own, I chose the empire state building and the great wall of china. I wanted to see how well it could capture the details.
1.7.1 Editing Hand Drawn and Web Images
So now we try using nonrealistic images. I tried a few different images, including an online image of an anime character, as well as two and drawn images.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
One interesting thing to note is that the image did not do well with the text within my image.
1.7.2 Inpainting
The next test we did is to see if the model can paint in parts that we remove. For example, if we tkae a block outside of the campanile, what would the model produce?
original
mask
hole to fill
result
original
mask
hole to fill
result
original
mask
hole to fill
result
These are some pretty interesting results!
1.7.3 Text-Conditional image to image Translation
Now, we'll give it a text condition, but input our original image with some noise to see how the model can try to get to our desired text.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
original photo
These are some very interseting pictures!
1.8 Visual Anagrams
Now, we'll try to have visual anagrams. This means that when we have it upright, the picture will look one way, and if we flip the picture, it'll look like something else. The way we did this is by adding two different noises. One would be the first image we chose. Then, we would flip the image, generate the noise given the second prompt, then flip it again. Then we would average these two noises to get the next timestep
People around a campfire
Oil painting of a man
Photo of a man
Photo of a dog
Coast of amalfi
Hipster person
These are some really cool applications of the stuff we worked on. The only one that didn't work out super well was the dog/human, but i feel like that is because it's hard to merge the two.
1.9 Hybrid Images
Finally, we'll implement hybrid images by creating another composite noise. The way this will work is if we combine the low frequencies of one noise with the high frequencies of another.
Skull/Waterfall
Oil painting of snow maountains/waterfalls
Rocket Pencil
2.1 Training a single step denoising UNET
Now, we'll try to actually build our own UNET denoiser, using an L2 loss. The way we trained it is similar to how we did earlier: by adding gaussian noise to our pictures. Then, we'll train it using an Adam optimizer. Here's how they look after a few epochs of training
Results after epoch 1
Results after epoch 5
Below is a graph of the loss.
Loss
2.1.2 Out of distribution testing
Now, we'll see what the denoised output is for varying levels of noise.
sigma = 0
sigma = 0.2
sigma = 0.4
sigma = 0.5
sigma = 0.6
sigma = 0.8
sigma = 1.0
2.1 Adding Time Conditioning to the UNet
Now, we'll try to inject a scalar t into our unet to condition it on time. The way we will do this is to add it in in between our up blocoks, and add to the result from what we had. Here is the resulting loss curve plot for the time conditioned UNet
Loss
Here are the results of the sampling from epoch 5 and epoch 20.
Results after epoch 5
Results after epoch 20
2.4 Adding Class-Conditioning
Now, we'll do something similar but add class conditioning. Here is the resulting loss curve.
Loss
Here are the results of the sampling from epoch 5 and epoch 20.
Results after epoch 5
Results after epoch 20