Gone are the days when small businesses and startups have had to rely on hiring models and celebrities and getting a photoshoot done with their products to advertise and publicize them. With the advent of new technologies like GANs, we have seen leaps in ad creation . However, a subset of the problem remained untouched — until today.
Never before was any model capable of generating brand new, full-sized, never-seen-before humans and compositing its image with that of a product.
Cue, AdGen. With AdGen, you can create new human models in poses of your choice and place a given product, a handbag, in this case, so that it looks like the human model is using/holding the product. The video below introduces AdGen in 30 seconds.
It is possible to address the problem of finding realistic geometric corrections to a foreground object such that it appears natural when composited into a background image. We have seen this being applied for glasses on human faces or for placing furniture in a room . However, extending the problem to address product placement on full-sized humans (and not just faces) poses a different set of issues altogether.
In the following post we discuss the technical details of the solution to the 2 part problem posed here — How do we generate brand new full-sized humans in specific poses? How do we superimpose the bags at the correct angle, size, and position on this newly generated person?
Problem 1: Human Image Generation
Synthesizing person images in arbitrary poses, based on an image of that person and a novel pose is doable . However, novel human image generation is still an unresolved problem in Deep Learning. StyleGAN2 + GFLA and Disentanglement Person Image Generation are 2 very different methods aimed at generating brand new humans in specified poses.
Approach 1: StyleGAN2 + GFLA
In this approach, we generate a new human image using StyleGAN2, feed the output to the GFLA method with the target pose to get a human image with that pose as shown in the image.
For training StyleGAN2, we are using the DeepFashion dataset containing around 79k images of size 128x128 with the batch size of 7 for 40,000 training steps. We used the pre-trained model of GFLA with the DeepFashion dataset to generate the human models in their new pose.
Approach 2: Disentanglement Person Image Generation
In this approach, we give the Disentanglement Person Image Generation method the target pose and sample an appearance in that pose. We sample appearances with fixed poses such that it would generate new humans in those fixed poses. The main advantage of disentanglement is that it offers more control in image generation from noise.
Approach 1 v/s Approach 2 Results
Qualitatively, both the methods give us good results for the same input poses as shown below. They successfully generate brand new never-seen-before humans in that particular input pose!
Quantitatively, both methods can be seen to have a low Inception score. The output images look like humans for sure, but if you were to zoom in, you would find some features sort of just merging into one another. MSE Loss was calculated based on input and output pose points and we got numbers close to 0 for both methods.
On having analyzed the outputs of both the approaches we have come to the conclusion that neither of them is state of the art. However, both the networks generate really good human images for our intents and purposes. We plan on cherry-picking the best images from both approaches and feeding that as an input to the next stage ST-GAN.
Problem 2: Adding Bags to the models
Having generated the models to pose with our products, now we actually needed to superimpose the two images into one.
The ST-GAN takes in a background (human image) and foreground (handbag) image as input and tries to find geometric corrections to the foreground object such that it appears realistic.
Experiments with ST-GAN
ST-GAN is inherently known to have worked well with images of human faces and compositing that with eyeglasses or with finding appropriate furniture placement in a room. We needed it to combine images of full-sized humans and handbags. We carried out at least 10 different experiments, 3 of which are being presented in this blog.
In ST-GAN the Generator outputs a set of geometric parameters (p) which are used to apply transformations to the foreground (bag) image so that it looks realistic in the background image. The Discriminator tries to classify if the image is real or fake (generated).
We optimize the WGAN objective with gradient penalty. The Discriminator and Generator loss functions are given below.
Here, x represents the composite image after using the parameters p generated by the generator is used on the input bag image and y is the real images of people with bags.
We train the ST-GAN in 2 phases. In the first phase, we train the Discriminator alone for 50,000 iterations by compositing the bag image to the center of the background image with random initial parameters p₀. In the next phase, we train both the Discriminator and a set of Generators iteratively. In each iteration, we use a new Generator which is conditioned on the output of the previous Generator (p(t-1)), while keeping the Discriminator the same throughout each iteration. The same loss function was used for all 3 experiments as we found it to work the best for this use case.
We first tried resizing the input image sizes to match the network architecture of the original ST-GAN paper and used the same transforms proposed in it as well. This served as a very good proof of concept that unpaired data could work with the new DeepFashion dataset and the crawled images from the web. We found that this experiment gave some good results, however, a lot of bags still blew up as can be seen in the bad results row.
On resizing the images as in Experiment 1 we found that using different interpolations like inter-cubic interpolations etc. smoothened the images a lot which resulted in a reduction in the quality of the input images. The quality of images in the real dataset was better than the fake our Generator was outputting. We inferred that some bags were blowing up because the crawled data of humans with bags had a non-white background. Even the ones with white background looked a little bit different from the DeepFashion dataset, which was curated and pre-processed by the original authors.
Thus, in the second experiment, we modified the Discriminator and Generator network to support 128 x 128 input images which is the original DeepFashion input size. Additionally, we added Normalization to our Discriminator network (the original ST-GAN paper does not have any Norm layers). We could not add Batch Normalization, as that would not work well with the WGAN objective with gradient penalty as explained here. Therefore, we decided to go with Layer Normalization as suggested in the same paper.
The GIFs actually indicate how the model successively applies affine transforms like rotation, positioning, tilting, etc. to fit at the correct location on the shoulder or palm. The updates are from each warp or iteration and we used 5 of those.
We obtained considerably better results than Experiment 1.
At this stage, the outputs looked pretty good with white backgrounds, but some bags still blew up. Thus, in experiment 3 we used the same architecture as experiment 2 but trained on scraped images without backgrounds.
This experiment gave us the best results. It resolved the exploding bag problem and even made our model more generalized. The first and second rows in the results below show each warp and iteration on random stock images taken from the internet.
3rd row shows an even diverse set of results we procured. Some images even have backgrounds and still posted the bags in the correct position. This serves well to tell us that our experiment generalizes well.
Comparison of all the experiments
The table succinctly describes the datasets and the changes done in each experiment.
Experiment 1 uses the original architecture from ST-GAN without any Norm Layers in the Discriminator. We can clearly see from the loss curve that the Discriminator is too strong and the loss has a lot of variance. In experiment 2, after adding the Layer Norm in the Discriminator, it is weaker and learns with the Generator. The generator also reaches closer to 0 than in experiment 1.
We did do a lot more experiments with different hyper-parameters and GAN Objectives. You can find them here.
We used DeepFashion — Fashion Synthesis Dataset for the training of DCGAN, GFLA method, and StyleGAN2. For training the ST-GAN to generate a human with bags, we required images of models holding bags. For this task, no ready-to-use dataset was available. Therefore, we curated our custom dataset with images of a human holding/wearing bags.
For the purpose of superimposing our product — the handbags, with the image of the newly generated human, we needed to cherry-pick images of bags from the web. We also had to ensure that all the bags were in .png format and had the 3 usual RGB channels along with an extra alpha channel to be used as a mask.
We passed all the 27,156 images through the TensorFlow Object Detection classifier, to filter out all images other than the images having a full human body and a bag in it. After filtering, we had 7,064 images satisfying our requirements to train ST-GAN.
The code for this project can be found here.
As shown in the video below the model is capable of generating new humans and compositing the newly generated humans images with bag images. In the 2 videos below, we show the step-by-step output generated from the StyleGAN2 + GFLA approach.
The images below show some outputs we obtained from both Approach 1 and Approach 2. It is easily observable that both give similar results.
The model architectures mentioned in the StyleGAN2, GFLA, disentanglement, and ST-GAN papers served as good frameworks for this use case. However, none of them could single-handedly solve the problem we discussed. No state-of-the-art pre-existing model is capable of generating full-sized humans and compositing that image with another product image. On training the StyleGAN2 on the DeepFashion dataset and passing that new human through to the GFLA model to generate a new image of the human in the given pose, we were able to solve the first problem. Disentanglement served as another method for solving the same problem. It was immensely important to have a human-generated in the given pose. This offers more control to advertisers in generating their ad posters. Not all poses can work for each type of advertisement. Different products, different angles of photography have a varied set of needs.
ST-GAN served as a good framework for part 2 of the problem. The architecture seemed to work for the most part but we faced a lot of issues in interfacing the GAN with the new datasets. Progressively, we experimented with different image transformations, input image sizes, adding different layers to the discriminator network, changing strides, padding and dilation to handle different input image sizes, and finally, the whole dataset itself to make it more generalizable. We tried it on one of our team members’ pictures as well, and guess what, it worked!
This model is by no means, state-of-the-art, but more importantly, serves as a proof of concept that something like this could work. This technology is here to disrupt the 1.2 trillion dollars  advertisement industry. The need of AdGen in the times of Covid when local businesses are suffering the most is a light at the end of the tunnel that gives them wings to advertise their products without having to shell out a single penny for it.
On this year's Small Business Saturday, we, Tensor Heads are proud to serve the needs of the businesses that form the backbone of the global economy which is a need of the hour also recognized by President-Elect Joe Biden and Vice President-Elect Kamala Harris.
The work mentioned above uses a combination of a couple of GANs for creating a brand new person image and ST-GAN for placing the bag image on the human. Due to time constraints, we could only run a limited number of experiments. For future work, we would like to try SF-GAN  instead of ST-GAN. SF-GAN uses a different discriminator for geometric realism and for appearance realism. The paper did not have enough details for us to implement this method in the limited time we had. This method might improve how these bags look on the human model.
 Ren, Y., Yu, X., Chen, J., Li, T. H., & Li, G. (2020). Deep Image Spatial Transformation for Person Image Generation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr42600.2020.00771
 F. Zhan, H. Zhu, and S. Lu, “Spatial Fusion GAN for Image Synthesis,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 K. Koidan, A. S. Says, A. St, R. P. C. says, and R. PC, “8 AI Companies Generating Creative Advertising Content,” 23-Aug-2020. [Online]. Available: https://www.topbots.com/ai-companiesgenerating-creative-advertising-content/. [Accessed: 06-Oct-2020]
 Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele and Mario Fritz, “Disentangled Person Image Generation” Available: https://homes.esat.kuleuven.be/ liqianma/pdf/CVPR18_Ma_Disentangled_Person_Image_Generation.pdf
 Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman and Simon Lucey “ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing” Available: https://arxiv.org/pdf/1803.01837.pdf
 Tero Karras, Samuli Laine, Miika Aittala and Janne Hellsten “Analyzing and Improving the Image Quality of StyleGAN” Available: https://arxiv.org/pdf/1912.04958.pdf
 Alec Radford, Luke Metz and Soumith Chintala “UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS” Available:https://arxiv.org/pdf/1511.06434.pdf
 Pang, B., & Xiu, Y. (2017, October 08). CS348 Computer Vision. Retrieved December 01, 2020, from https://www.mvig.org/research/alphapose.html
 Ildoonet. (n.d.). Ildoonet/tf-pose-estimation. Retrieved December 01, 2020, from https://github.com/ildoonet/tf-pose-estimation
 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville “Improved Training of Wasserstein GANs” Available: https://arxiv.org/pdf/1704.00028.pdf
 Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, Luc Van Gool “Pose Guided Person Image Generation” Available: https://arxiv.org/pdf/1705.09368.pdf
 T. Mutunhire, “The Advertising Industry Is Now Worth $1.2 Trillion, As Marketing in its various forms continues to grow through mobile, content marketing, social platforms, and new digital platforms,” Towers of Zeyron, 02-Dec-2017. [Online]. Available: https://towersofzeyron.com/theadvertising-industry-is-now-worth-1-2-trillion-marketing-in-its-various-forms-continues-togrow-through-mobile-content-marketing-social-platforms-and-new-digital-platforms/. [Accessed: 06-Oct-2020]