[1907.01710] Mask Embedding in conditional GAN for Guided Synthesis of High Resolution Images
Our experiment is based on semantic input, and the same concept applies to other conditional input such as textures and text.
Abstract: Recent advancements in conditional Generative Adversarial Networks (cGANs)
have shown promises in label guided image synthesis. Semantic masks, such as
sketches and label maps, are another intuitive and effective form of guidance
in image synthesis. Directly incorporating the semantic masks as constraints
dramatically reduces the variability and quality of the synthesized results. We
observe this is caused by the incompatibility of features from different inputs
(such as mask image and latent vector) of the generator. To use semantic masks
as guidance whilst providing realistic synthesized results with fine details,
we propose to use mask embedding mechanism to allow for a more efficient
initial feature projection in the generator. We validate the effectiveness of
our approach by training a mask guided face generator using CELEBA-HQ dataset.
We can generate realistic and high resolution facial images up to the
resolution of 512*512 with a mask guidance. Our code is publicly available.
‹ (Introduction)Figure 3: Architecture of our network. Left: a U-Net style generator. Right: a discriminator consists of several convolution layers. (Pixel-Level Mask Constraint and Model Design)Figure 4: An illustrative example of generating an image of a dog using a dog mask as the guidance. Left: illustrative feature space of the image generation process using a series of convolution layers. Right: two examples of generating a dog image given a dog mask with and without mask embedding. For simplicity the latent features are visualized as low resolution images. At inference time, an ideal model with mask embedding projects base features onto the correct manifold and performs proper up-sampling through convolution layers; However, model without mask embedding learns to (1) project only average base image; (2) inefficiently map the average base image to a dog to comply with the mask constraint. (Formulation)Figure 5: (a) Input mask. (b) Synthesized image using Pix2Pix (c) Synthesized image using our without-embedding baseline model. (d) Synthesized image using our proposed embedding model. (Qualitative Comparison)Figure 6: (a) Input mask. (b) Original Image. (c), (d), (e) synthesized images using the same mask but different latent vectors (Changing Latent Input)›