all_data.csv
from safebooru dataset hereall_data.csv
in PATH_TO_SAFEBOORU_ALL_DATA_CSV
of cartoon_image_downloader.py
PATH_TO_STORE_DOWNLOADED_CARTOON_IMAGES
CARTOON_IMAGES_ZIPFILE_NAME
make install
to install necessary librariesmake cartoons
to download configurable amount of medium size imagesPATH_TO_STORED_CARTOON_IMAGES
in cartoon_image_smoothing.py
PATH_TO_STORE_SMOOTHED_IMAGES
SMOOTHED_IMAGES_ZIPFILE_NAME
make cartoons-smooth
to create the imagesPATH_TO_COCO_ANNOTATIONS_ROOT_FOLDER
of photo_downloader.py
/tmp/annotations/
, then please configure PATH_TO_COCO_ANNOTATIONS_ROOT_FOLDER
to /tmp
PATH_TO_STORE_DOWNLOADED_PHOTOS
PHOTO_ZIPFILE_NAME
make photos
to download configurable amount of photos of personsToTensor()
-method changes the range of the input image from RGB [0, 255] to [0.0, 1.0], we get the same range for all images.
image_size = 256
batch_size = 16
transformer = transforms.Compose([
transforms.CenterCrop(image_size),
transforms.ToTensor()])
cartoon_dataset = ImageFolder('cartoons/', transformer)
len_training_set = math.floor(len(cartoon_dataset) * 0.9)
len_valid_set = len(cartoon_dataset) - len_training_set
training_set, _ = random_split(cartoon_dataset, (len_training_set, len_valid_set))
cartoon_image_dataloader_train = DataLoader(training_set, batch_size, shuffle=True, num_workers=0)
zero padding
of the convolutional layers, I use the following formula:
$$Height x Width_{output} = \frac{HeightxWidth_{input} - kernel size + 2 padding}{stride} + 1$$
e.g:
conv_1
layer of generator: $HxW$ should stay the same as input size, which is 256x256 and stride = 1
$$256 = \frac{256-7+2padding}{1}+1, padding = 3$$
In case of a fraction as a result, I choose to ceil:
conv_2
layer of generator: $\frac{H}{2} x \frac{W}{2}$ is output with stride=2
$$128 = \frac{256-3+2padding}{2}+1, padding= \frac{1}{2} \Rightarrow padding=1$$
conv2D()
does not allow floating point stride, I use a stride of 1 in both cases. Therefore even for the padding calculation I go with a stride of 1.
I re-checked my implementation of the generator and stumbled across my interpretation of the stride for and
conv_8
, which is $\frac{1}{2}$ in the paper. Maybe I got the part of the stride wrong, and this is not $\frac{1}{2}$, but a tuple of $(1,2)$? If so, I did a wrong padding calculation. This are the only layers where I calculated a very large padding of 33 and 65, which looks suspicious now.
Testing a tuple of $(1,2)$, I also ended up with very high values for the padding.
The next problem was, that I used Conv2d
for up-sampling, but Conv2d
is for down-sampling. ConvTranspose2d
is for up-sampling, see [6] and [7]. I corrected my implementation accordingly.
By using ConvTranspose2d
with the values for stride
(1 or (1,2)) and kernelsize
, and playing with padding
, the resulting image keeps nearly the same dimension, shrinks or gets uneven dimensions.
$$ HeightxWidth_{Output} = stride (HeightxWidth_{Input} - 1) + kernelsize - 2*padding$$
$$ HeightxWidth_{Output} = 1 * (64 - 1) + 3 - 2 * padding = 66 - 2 * padding = \bigg\{^{66, p = 0}_{<0, p < 0}$$
Uneven dimensions: I tested stride $(1,2)$ with padding $(3,2)$ and got $60x125$ as image size.
But as mentioned in the paper, I need to scale from $\frac{H}{4}$ up to $\frac{H}{2}$, which is from 64 to 128, and then up to 256 in conv_8
and conv_9
. Therefore I decided to use stride=2
and padding=1
in conv_6
, and stride=2
and padding=1
in conv_8
. To add the last pixel, I add an output_padding
of 1.
conv_6
: $2*(64-1)+3-(2*1)=127 + 1$ (outer_padding)
conv_8
: $2*(128-1)+3-(2*1)=255 + 1$ (outer_padding)
For the discriminator, this is the formula for the loss function, because output of the Discriminator plays no role within the content loss part of the loss function.
For the initialization phase of the generator, this part of the formula is not used as described in the paper.
For the training phase of the generator, only the part of the formula is used within the generator loss function, which the generator can affect: $$\mathbb{E}_{pk∼S_{data}(p)}[log(1 − D(G(p_k)))]$$
conv4_4
. The output of the layer $l$ for the original photo is subtracted from the output of the layer $l$ of the generated image. The result is regularized using the $\mathcal{L_1}$ spare regularization ($||...||_1$):
$$\mathcal{L}_{con}(G, D)= \mathbb{E}_{pi~S_{data}(p)}[||VGG_l(G(p_i))-VGG_l(p_i)||_1]$$
This part of the formula plays a role in the loss function for the generator, not for the discriminator, because only the generator is used within this formula.
More info about $\mathcal{L_1}$ regularization in [10] and [11].
At this section I will describe my learnings during implementing/testing the loss functions.
batch_size x 1 x 64 x 64
as input to the discriminator loss (to be precise, all three outputs of D
, from D(cartoon_image)
, D(smoothed_cartoon_image)
and D(G(photo))
).
As the adversarial loss outputs a probability which indicates if the input is detected as fraud or not, it returns a single value. To reach this, I took the input tensor with shape batch_size x 1 x 64 x 64
, and implemented the loss function
$$\mathcal{L}_{adv}(G, D) = \mathbb{E}_{ci∼S_{data}(c)}[log D(c_i)]
+ \mathbb{E}_{ej∼S_{data}(e)}[log(1 − D(e_j))]
+ \mathbb{E}_{pk∼S_{data}(p)}[log(1 − D(G(p_k)))]$$
manually as
$$torch.log(torch.abs(D(...)) + torch.log(torch.abs(1 - D(...)) + torch.log(torch.abs(1 - D(...))$$
As the discriminator output sometimes contains negative values, calling log()
directly with this value causes an error. Therefore I wrapped abs()
around the input of log()
.
As my training results weren't as expected, I came back to the loss functions. As an adversarial loss outputs a probability, thus a single value, my discriminator outputs a tensor with shape batch_size x 1 x 64 x 64
.
BCEWithLogitsLoss
, which combines activation function and loss.
But which activation function to use?
As the discriminator should give a probability and only has two classes as outputs, real
or fake
, using sigmoid or softmax is a good choice. Softmax can be used for binary classification as well as classification of $n$-classes.
First, I decided to use a loss function, which combines activation and loss function, and this gave me the choice between:
BCEWithLogitsLoss
: Sigmoid and binary cross entropy lossCrossEntroyLoss
: Softmax and negative log likelihood lossFor solving a minimax-problem, which loss to choose?
"If [minimax] implemented directly, this would require changes be made to model weights using stochastic ascent rather than stochastic descent. It is more commonly implemented as a traditional binary classification problem with labels 0 and 1 for generated and real images respectively."
Therefore I choosed BCEWithLogitsLoss
.
As BCEWithLogitsLoss
has two parameters, one for the input and one for the target, I used BCEWithLogitsLoss
three times, one for every different input, and added the values up.
But, after trying to go with this solution, the generator produces values lower than zero. This lead to problems when trying to map these values to RGB. Therefore I decide to not combine activation and loss function, and use sigmoid in the generator as well as in the discriminator directly and use BCELoss
as loss function.
Initially, I set $\omega$, which is a weight to balance the style and the content preservation, to the value given in the paper, which is 10. After running 210 epochs, the content preservation was very good, but the generated images do not have cartoon styles. Maybe this is a problem with my input data, where I use different cartoon styles from different artists instead from one single artist, as used in the paper.
direct after start, one of the first epochs | |
direct after init-phase is completed | |
Photo input | Generated image |
direct at the beginning of epoch 11 with use of full generator loss instead of init loss. These results seem to be outliers at this stage of training due to the next outputs look more similar like the inputs | |
Photo input | Generated image |
After training has finished 210 epochs, the output looks like this | |
So the content loss is magnitudes higher than the adversarial loss.
Maybe my calculation of the content loss is wrong? Should it be much lower? As the generated images preserve the content very good, I concentrate on the adversarial loss.
Maybe I use the wrong VGG-Model? Image preservation is not the problem, therefore I concentrate on comic style.
As the adversarial loss is responsible for the comic-effect, I try a much lower $\omega$, to balance the values of g_content_loss
and g_adversarial_loss
on an equal level for the next training round.
As g_content_loss
has values of $4e+5$, I choose $\omega=0.00001$. After 210 epochs, the result is the following:
The optimization of the content loss falls a little behind in comparison to training round 1, where $\omega$ was much higher and therefore plays a bigger role in the total loss.
On the other hand, the optimization of the adversarial loss gets better, due it plays a bigger role for the total loss.
As seen in the example images, the comic effect starts to kick in and the content is still preserved.
As a next trial, I set $\omega=0$, to have the optimization effect on the adversarial loss only.
With $\omega=0$, the trained model looses the content information but the comic-style hits through:
not part of the loss result due to set to zero, but plotted here to visualize! | |
As comic-style starts to hit through around $\omega=0.00001$ and has full dominance at $\omega=0$, I try some $\omega$ between these values for the next round.
Set $\omega = 0.00001 / 2 = 0.000005$
Unfortunately, I did not clean up my tensorboard results folder before re-training, so the results are plotted to the existing ones from round 3. I tried to make it a little bit more clear to see by manually image editing. The green plot is the old one from round 3, the orange plot is the actual one from round 4.
Not every image is transformed as desired, but the results are there :-)
In the paper, the used optimizer is not mentioned, I decide to choose adam.
For hyperparameter-tuning, I decided to go with the same parameters mentioned in the DCGAN-paper [13].
generator_release.pth.zip
to root path of this projectmake install-transform
make transform IMAGE=path_to_image
make transform IMAGE=~/Pictures/photo.jpg
transformed.JPG
is created