{"cells":[{"cell_type":"markdown","metadata":{},"source":["# [1912.06112v1] Unified Generative Adversarial Networks for Controllable Image-to-Image Translation"]},{"cell_type":"markdown","metadata":{},"source":["Controllable image-to-image translation, i.e., transferring an image from a source domain to a target one guided by controllable structures, has attracted much attention in both academia and industry. In this paper, we propose a unified Generative Adversarial Network (GAN) framework for controllable image-to-image translation. In addition to conditioning on a reference image, we show how the model can generate images conditioned on controllable structures, e.g., class labels, object keypoints, human skeletons and scene semantic maps. The proposed GAN framework consists of a single generator and a discriminator taking a conditional image and the target controllable structure as input. In this way, the conditional image can provide appearance information and the controllable structure can provide the structure information for generating the target result. Moreover, the proposed GAN learns the image-to-image mapping through three novel losses, i.e., color loss, controllable structure guided cycle-consistency loss and controllable structure guided self-identity preserving loss. Note that the proposed color loss handles the issue of “channel pollution” when backpropagating the gradients. In addition, we present the Fréchet ResNet Distance (FRD) to evaluate the quality of generated images. Extensive qualitative and quantitative experiments on two challenging image translation tasks with four different datasets, i.e., hand gesture-to-gesture translation and cross-view image translation, demonstrate that the proposed GAN model generates convincing results, and significantly outperforms other state-of-the-art methods on both tasks. Meanwhile, the proposed GAN framework is a unified solution, thus it can be applied to solving other controllable structure guided image-to-image translation tasks, such as landmark-guided facial expression translation and keypoint-guided person image generation. To the best of our knowledge, we are the first to make one GAN framework work on all such controllable structure guided image translation tasks. The source code, data and trained models are available at https://github.com/Ha0Tang/GestureGAN."]},{"cell_type":"markdown","metadata":{},"source":["Experimental results show that the proposed unified GAN framework achieves competitive performance compared with the state of the art using carefully designed frameworks on two challenging generative tasks, i.e., hand gesture-to-gesture translation and cross-view image translation"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f0.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 1: Comparison with the state-of-the-art image-to-image translation methods. (a) Traditional deep learning methods, e.g., Context Encoder [1]. (b) Adversarial learning methods, e.g., Pix2pix [2] and BicycleGAN [3]. (c) Keypoint-guided image generation methods, e.g., PG2 [4], G2GAN [5] and DPIG [6]. (d) Skeleton-guided image generation methods, e.g., SAMG [7] and PoseGAN [8]. (e) Semantic-guided image generation methods, e.g., SelectionGAN [9] and XFork [10]. (f) Adversarial unsupervised learning methods, e.g., CycleGAN [11], DiscoGAN [12] and DualGAN [13]. (g) Multi-domain image translation methods, e.g., ComboGAN [14], G2 GAN [15] and StarGAN [16]. (h) Proposed GAN model in this paper. Note that the proposed GAN model is a unified solution for controllable structure guided image-toimage translation problem, i.e., controllable structure C can be one of class label L, object keypoint K, human skeleton S or semantic map M. Notations: x and y are the real images; x0 and y0 are the generated images; x00 and y00 are the reconstructed images; Ky is the keypoint of y; Sy is the skeleton of y; My is the semantic map of y; Lx and Ly are the class labels of x and y, respectively; Cx and Cy are the controllable structures of x and y, respectively; G, GX7→Y and GY 7→X represent generators; D, DY and DX denote discriminators. (Introduction)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f1.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 2: Pipeline of the proposed unified GAN model for controllable image-to-image translation tasks. The proposed GAN framework consists of a single generator G and an associated adversarial discriminator D, which takes a conditional image x and a controllable structure Cy as input to produce the target image y0 . We have two cycles and here we only show one of them. Note that the controllable structure Cy can be a class label, object keypoints, human skeletons, semantic maps, etc. (Related Work)"]},{"cell_type":"markdown","metadata":{},"source":["$$\n\\begin{equation}\\small\n\\begin{aligned}\nx''= G(y', C_x) = G(G(x, C_y), C_x) \\approx x.\n\\end{aligned}\n\\end{equation}$$"]},{"cell_type":"code","execution_count":0,"outputs":[],"metadata":{},"source":["small * x'__deriv == G(y__deriv, C_x) == G(G(x, C_y), C_x) =~ \n"," x"]},{"cell_type":"markdown","metadata":{},"source":["$$\n\\begin{equation}\\small\n\\begin{aligned}\ny''= G(x', C_y) = G(G(y, C_x), C_y) \\approx y.\n\\end{aligned}\n\\end{equation}$$"]},{"cell_type":"code","execution_count":0,"outputs":[],"metadata":{},"source":["small * y'__deriv == G(x__deriv, C_y) == G(G(y, C_x), C_y) =~ \n"," y"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f2.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 3: Illustration of the “channel pollution” issue. From left to right: input image, ground truth, PG2 [4] and ours. (Optimization Objective)"]},{"cell_type":"markdown","metadata":{},"source":["$$\n\\begin{equation}\n\\begin{aligned}\n\\mathcal{L}_{cyc}(G, C_x, C_y) = & \\mathbb{E}_{x, C_x, C_y} \\left[ \\left|\\left| x - G(G(x, Cy), C_x) \\right|\\right|_1 \\right] \\\\\n+ & \\mathbb{E}_{x, C_x, C_y} \\left[ \\left|\\left| y - G(G(y, Cx), C_y) \\right|\\right|_1 \\right], \n\\end{aligned}\n\\label{eqn:con}\n\\end{equation}$$"]},{"cell_type":"code","execution_count":0,"outputs":[],"metadata":{},"source":["def expectation(l):\n"," N=10000\n"," return mean(l() for _ in range(N))\n","\n","L_cyc(G, C_x, C_y) == expectation(lambda: abs_1(abs(x - G(G(x,\n"," C * y), C_x)))) \n"," + expectation(lambda: abs_1(abs(y - G(G(y, C * x), C_y))))"]},{"cell_type":"markdown","metadata":{},"source":["$$\n\\begin{equation}\\small\n\\begin{aligned}\n& \\mathcal{L}_{cGAN}(G, D) = \\mathbb{E}_{x, y} \\left[ \\log D(x, y) \\right] + \n\\mathbb{E}_{x} \\left[\\log (1 - D(x, G(x))) \\right],\n\\end{aligned}\n\\label{eqn:conditonalgan}\n\\end{equation}$$"]},{"cell_type":"code","execution_count":0,"outputs":[],"metadata":{},"source":["def expectation(l):\n"," N=10000\n"," return mean(l() for _ in range(N))\n","\n","small L_cGAN(G, D) == expectation(lambda: log(D(x, y))) + E_x \n"," * log((1 - D(x, G(x))))"]},{"cell_type":"markdown","metadata":{},"source":["$$\n\\begin{equation}\\small\n\\begin{aligned}\n\\mathcal{L}_{adv}(G, D, C_y) = & \\mathbb{E}_{[x,C_y], y} \\left[\\log D([x, C_y], y) \\right] \\\\\n+ & \\mathbb{E}_{[x,C_y]} \\left[\\log (1 - D([x, C_y], G(x, C_y))) \\right],\n\\end{aligned}\n\\label{eqn:keypointgan}\n\\end{equation}$$"]},{"cell_type":"code","execution_count":0,"outputs":[],"metadata":{},"source":["def expectation(l):\n"," N=10000\n"," return mean(l() for _ in range(N))\n","\n","small * L_adv(G, D, C_y) == expectation(lambda: log(D((x, C_y)\n"," , y))) \n"," + E_(x, C_y) * log((1 - D((x, C_y), G(x, C_y))))"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f3.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 4: Different methods for hand gesture-to-gesture translation task on NTU Hand Digit dataset. (Implementation Details)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f4.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 5: Different methods for hand gesture-to-gesture translation task on Senz3D dataset. (Experimental Setup)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f5.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 6: Different methods for cross-view image translation task in 256×256 resolution on Dayton dataset. (Experimental Setup)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f6.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 7: Different methods for cross-view image translation task in 256×256 resolution on CVUSA dataset. (Ablation Study)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f7.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 8: Arbitrary hand gesture-to-gesture translation of our model. (Ablation Study)"]},{"cell_type":"markdown","metadata":{},"source":["![title](https://a2c.fyi/q/6faL6RrK08/f8.png)"]},{"cell_type":"markdown","metadata":{},"source":["Fig. 9: Arbitrary cross-view image translation of our model. (Cross-View Image Translation)"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"nbformat":4,"nbformat_minor":2}