Sunteți pe pagina 1din 6

Edge-to-Image Reconstruction using Pix2Pix Conditional GAN

Abstract also need to specifically mention the loss function that the
CNN has to minimize. If we don’t, the CNN might take
In our project, we investigate conditional adversarial a naive approach and produce disfigured images. Sharp,
networks that have been used in a variety of image-to-image realistic images need accurate loss functions, predicting
translation tasks for quite some time now. These networks which requires quite a lot of expert knowledge.
not only learn the mapping from the input image to the out-
put image but also learn the loss function to train this map- We thus need to make use of a network that automat-
ping. We’ve thus been able to apply a similar approach to ically learns an appropriate loss function to make the
a wide variety of tasks with different loss functions. Most output indistinguishable from reality. This is exactly what
problems in image processing, computer graphics & com- Generative Adversarial Networks (GANs) do. Using a
puter vision can be posed as translating an input image into Generator G, GANs produce the output image which is
a corresponding output image. These tasks include synthe- then classified by the Discriminator as real or fake. This
sizing photos from labels, converting aerial images to maps, totally eliminates the possibility of producing a blurry [5, 6]
and colorizing images. In this project, we solely focus on output since it will easily be classified as a fake image. In
one specific application: Pix2Pix, i.e., the reconstruction of our project we direct our focus on Conditional Generative
objects from edge maps. Adversarial Networks (CGANs), where we condition on an
input image to produce a corresponding output image.

1. Introduction 2. Background/Related Work


Automatic Image-to-Image translation can be defined as Prior and concurrent works have already applied con-
the problem of translating one possible representation of ditioned GANs on text [7] & discrete labels [8, 9, 10] in
a scene into another, given sufficient training data. All of addition to images. These include prediction of an image
these tasks pose a similar setting : prediction of pixels from from a normal map [11], product photo generation [12] and
pixels. Here in Pix2Pix, we aim to generate the image of an future frame prediction [13]. Several other papers that used
object given the input edge map. GANs unconditionally have also achieved notable results
on inpainting [5], style transfer [14], future state prediction
The context and background for the proposed work is [15] and superresolution [16].
based upon the 2017 CVPR paper Image-to-Image Transla-
tion with Conditional Adversarial Networks [1]. The paper Our base paper [1] uses a ”U-Net” based architecture
implements this task on two specific datasets of shoes and [17] for the generator and a convolutional ”PatchGAN”
handbags. Our improvisation upon this implementation is classifier which only penalizes structure at the scale of
to apply a similar approach on a completely new dataset image patches for the discriminator. A similar PatchGAN
of fruits [4] obtained from an open source collection. architecture [14] that was previously proposed can also be
However, the training dataset containing the combined used to address a wide range of problems by changing the
images of edges map (A) and object (B) is built from patch size.
scratch.
Even though these problems map pixels to pixels in
A significant amount of work has already been carried out general, they are often treated with application specific
in this direction, using Convolutional Neural Networks algorithms. In the paper [1], Conditional Adversarial
(CNNs) as a common solution for a wide range of image Networks (CGANs) are used as a general purpose solution
prediction problems. The main drawback of using a CNN on several tasks such as follows: Labels to Street Scene,
is that it takes a lot of manual effort into designing effective Labels to Facade, Black and White to Color, Aerial to Map
losses, even though the learning process is automatic. We and Day to Night.

1
Another interesting application of Pix2Pix has been
made by Christopher Hesse [18] in his Image-to-Image
demo of cats. Around 2k stock photos and edges automati-
cally generated from those photos were used for training.

3. Approach
In our approach to implement the Pix2Pix model on an
entirely new dataset, we first need to prepare our training
and test data. The input images thus need to be resized,
combined with their edge maps and then split into train and
test sets.

Figure 1. HED architecture with multiple side outputs

We follow the resizing procedure as implemented in our


base paper scripts. For generating the edge-maps we make
use of the the Holistically-Nested Edge Detection algorithm
[2], also known as the HED algorithm. HED aims to aims
to train and predict edges in an image-to-image fashion.
It does so without explicitly modeling structured output.
This is what the term ’holistic’ stands for. It basically
produces progressively refined edge-maps as side outputs.

2
The successively inherited edge-maps are more concise,
thus emphasizing the term ”nested”. This approach depicts
an integrated learning of hierarchical features.

Once the edge maps are generated from our input


images, we then combine the two together to create our
training and test sets. We then train our Pix2Pix model
using the newly created dataset.

An example of an input image combined with its edge


map is pictured below: objective, such as L2 distance. The generator is asked to
be near the ground truth output in addition to fooling the
discriminator.

The final objective is:

Figure 2. An input image combined with its edge map

Generative Adversarial Networks (GANs) usually learn


a mapping from random noise vector z to output image
y. In this generative model, G : z → y [19]. However, The noise z here is provided in the form of dropout which
Conditional GANs slightly differ in their inputs. They is applied on several layers of the generator G during both
learn a mapping from observed image x and random noise training and test time.
vector z, to y, G : {x, z} → y. The generated G produces
output images that are indistinguishable from the ”real” 4. Experiment
images. The adversarially trained discriminator D tries its
The proposed model is implemented in PyTorch and
best to classify the generator’s ”fakes” (synthesized by the
we’ve made use of Google Colab to improve the computa-
generator) and real {edge, photo} tuples. This procedure is
tional efficiency. We first implememted our Pix2Pix model
depicted as follows:
on an existing preprocessed dataset of shoes [3] to evaluate
its functionality. The 50K training images were obtained
from UTZapoos50K dataset and trained for 15 epochs with
a batch size of 4.

Figure 3 is an example of an input image given to


test our model.

A conditional GAN’s objective is as follows. Here,


G minimizes this objective against an adversarial D that
tries to maximise it. We now compare this conditional
discriminator to an unconditional variant which does not
observe x. Through this approach, we test the importance
of conditioning.

It is beneficial to add a more traditional loss to the GAN Figure 3.

3
Figure 4 is the output image generated by our Pix2Pix Figure 6 is an example of an input fruit image given to
model corresponding to the edge-map given in Figure 3. test our model.

Figure 4. Figure 7.

Figure 5 is the actual ground truth (real) image corre- Figure 7 is the output fruit image generated by our
sponding to the edge-map given as input. Pix2Pix model corresponding to the edge-map given in Fig-
ure 6.

Figure 5.
Figure 8.
From the results obtained above, we could make sure
that our Pix2Pix model was working perfectly on the dataset
proposed in the base paper.In order to extend the implemen- Figure 8 is the actual ground truth (real) fruit image
tation to our new dataset of fruits [4], the stock images were corresponding to the edge-map given as input.
obtained from an open source kaggle dataset, resized and
then combined with their edge-maps. The data was ran- We now see that the Pix2Pix model proposed in our
domly split into about 5k training images and 300 test im- base paper can with no doubt be extended to a customized
ages. After appropriate HED post-processing using MAT- dataset of our choice. Even though the images look close
LAB and training the model, the results obtained were as to perfect to the human eye, there are multiple evaluation
follows: techniques that can be applied to verify its correctness such
as the ones mentioned below.

Evaluating GAN outputs are difficult and there are many


different ways of doing this. One of the strategies is to use
human scoring. The real images and the images that are
created through the Pix2Pix model are randomly stacked
together and evaluated by human scorers as real or fake by
looking at the images for a certain period of time.

5. Conclusion
Image-to-image translation is a challenging problem and
Figure 6. often requires specialized models and loss functions for a

4
given translation task or dataset. The results in this pa- [14] C. Li and M. Wand. Precomputed real-time texture
per suggest that conditional adversarial networks give out synthesis with markovian generative adversarial net-
notable results for many image-to-image translation tasks. works. ECCV, 2016.
These networks learn a loss adapted to the task and data
at hand, which makes them applicable in a wide variety of [15] Y. Zhou and T. L. Berg. Learning temporal transfor-
settings. mations from time-lapse videos. ECCV, 2016.

[16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-


References ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
[1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei Z. Wang, et al. Photo-realistic single image super-
A. Efros. Image-to-Image Translation with Conditional resolution using a generative adversarial network.
Adversarial Networks. Computer Vision and Pattern arXiv preprint arXiv:1609.04802, 2016.
Recognition (CVPR), 2017 IEEE Conference.
[17] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
[2] Saining Xie and Zhuowen Tu. Holistically-Nested Edge volutional networks for biomedical image segmenta-
Detection. ICCV, 2015. tion. MICCAI, pages 234–241. Springer, 2015.

[3] Aron Yu, Kristen Grauman. Fine-Grained Visual Com- [18] Christopher Hesse-Interactive Image Translation with
parisons with Local Learning. CVPR, 2014. pix2pix-tensorflow https://affinelayer.com/pixsrv/

[4] Horea Muresan, Mihai Oltean. Fruit recognition from [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
images using deep learning. Acta Univ. Sapientiae, In- Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
formatica Vol. 10, Issue 1, 2018. Generative adversarial nets. NIPS, 2014.

[5] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and [20] https://machinelearningmastery.com/a-gentle-


A. A. Efros. Context encoders: Feature learning by in- introduction-to-pix2pix-generative-adversarial-
painting. CVPR, 2016. network/ A Gentle Introduction to Pix2Pix Generative
Adversarial Network
[6] R. Zhang, P. Isola, and A. A. Efros. Colorful image col-
orization. ECCV, 2016. [21] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley.
Color transfer between images. IEEE Computer Graph-
[7] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, ics and Applications, 21:34–41, 2001.
and H. Lee. Generative adversarial text to image syn-
thesis. arXiv preprint arXiv:1605.05396, 2016. [22] T. Salimans, I. Goodfellow, W. Zaremba, V. Che-
ung, A. Radford, and X. Chen. Improved techniques for
[8] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gener- training gans. arXiv preprint arXiv:1606.03498, 2016.
ative image models using alaplacian pyramid of adver-
sarial networks. In NIPS, pages 1486–1494, 2015. [23] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-
driven hallucination of different times of day from a
[9] J. Gauthier. Conditional generative adversarial nets for single outdoor photo. ACM Transactions on Graphics
convolutional face generation. Class Project for Stan- (TOG), 32(6):200, 2013.
ford CS231N: Convolutional Neural Networks for Vi-
sual Recognition, Winter semester, 2014(5):2, 2014. [24] J. Long, E. Shelhamer, and T. Darrell. Fully convo-
lutional networks for semantic segmentation. CVPR,
[10] M. Mirza and S. Osindero. Conditional generative ad- 2015.
versarial nets. arXiv preprint arXiv:1411.1784, 2014.
[25] D. Marr and E. Hildreth. Theory of edge detection.
[11] X. Wang and A. Gupta. Generative image modeling Proceedings of the Royal Society of London. Series B,
using style and structure adversarial networks. ECCV, Biological Sciences. 207(1167):187–217, 1980.
2016.
[26] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to
[12] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. detect natural image boundaries using local brightness,
Pixel-level domain transfer. ECCV, 2016. color, and texture cues. PAMI, 26(5):530–549, 2004.

[13] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi- [27] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
scale video prediction beyond mean square error. normalization: The missing ingredient for fast styliza-
ICLR, 2016. tion. arXiv preprint arXiv:1607.08022, 2016.

5
[28] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional
neural networks applied to house numbers digit classi-
fication. ICPR, 2012.
[29] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-
celli. Image quality assessment: from error visibility to
structural similarity. IEEE Transactions on Image Pro-
cessing, 13(4):600–612, 2004.
[30] D. Eigen and R. Fergus. Predicting depth, surface nor-
mals and semantic labels with a common multi-scale
convolutional architecture. In Proceedings of the IEEE
International Conference on Computer Vision, pages
2650–2658, 2015.
[31] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there
be Color!: Joint End-to-end Learning of Global and
Local Image Priors for Automatic Image Colorization
with Simultaneous Classification. ACM Transactions
on Graphics (TOG), 35(4), 2016.

S-ar putea să vă placă și