[1911.12287v1] Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models
This new inversion method allowed us to visualize our network on approximations of real images and also to test how good a generative model is in this important coverage task
Abstract We introduce a new local sparse attention layer that preserves two-dimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from 18.65 to 15.94 on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs. We also present a novel way to invert Generative Adversarial Networks with attention. Our method uses the attention layer of the discriminator to create an innovative loss function. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of two-dimensional geometry of real images.
‹(a) Attention masks for Fixed Pattern [6]. (b) Attention masks for Left To Right (LTR) pattern. (c) Attention masks for Right To Left (RTL) pattern. (d) Information Flow Graph associated with Fixed Pattern. This pattern does not have Full Information, i.e. there are dependencies between nodes that the attention layer cannot model. For example, there is no path from node 0 of V 0 to node 1 of V 2. (e) Information Flow Graph associated with LTR. This pattern has Full Information, i.e. there is a path between any node of V 0 and any node of V 2. Note that the number of edges is only increased by a constant compared to the Fixed Attention Pattern [6], illustrated in ??. (f) Information Flow Graph associated with RTL. This pattern also has Full Information. RTL is a ”transposed” version of LTR, so that local context at the right of each node is attended at the first step. (g) This Figure illustrates the different 2-step sparsifications of the attention layer we examine in this paper. First row demonstrates the different boolean masks that we apply to each of the two steps. Color of cell [i. j] indicates whether node i can attend to node j. With dark blue we indicate the attended positions in both steps. With light blue the positions of the first mask and with green the positions of the second mask. The yellow cells correspond to positions that we do not attend to any step (sparsity). The second row illustrates Information Flow Graph associated with the aforementioned attention masks. An Information Flow Graph visualizes how information ”flows” in the attention layer. Intuitively, it visualizes how our model can use the 2-step factorization to find dependencies between image pixels. At each multipartite graph, the nodes of the first vertex set correspond to the image pixels, just before the attention. An edge from a node of the first vertex set, V 0 , to a node of the second vertex set, V 1 , means that the node of V 0 can attend to node of V 1 at the first attention step. Edges between V 1 , V 2 illustrate the second attention step. (Your Local GAN)Figure 3: Reshape and ESA enumerations of the cells of an image grid that show how image grid is projected into a line. (Left) Enumeration of pixels of an 8 × 8 image using a standard reshape. This projection maintains locality only in rows. (Right) Enumeration of pixels of an 8 × 8 image, using the ESA framework. We use the Manhattan distance from the start (0, 0) as a criterion for enumeration. Although there is some distortion due to the projection into 1-D, locality is mostly maintained. (Two-Dimensional Locality)Figure 4: Training comparison for YLG-SAGAN and SAGAN. We plot every 200k steps the Inception score (a) and the FID (b) of both YLGSAGAN and SAGAN, up to 1M training steps on ImageNet. As it can be seen, YLG-SAGAN converges much faster compared to the baseline. Specifically, we obtain our best FID at step 865k, while SAGAN requires over 1.3M steps to reach its FID performance peak. Comparing peak performance for both models, we obtain an improvement from 18.65 to 15.94 FID, by only changing the attention layer. (Experimental Validation)(a) (b) (c) (d) (e) (f) Inversion and Saliency maps for different heads of the Generator network. We emphasize that this image of a redshank bird was not in the training set, it is rather obtained by a Google image search. Saliency is extracted by averaging the attention each pixel of the key image gets from the query image. We use the same trick to enhance inversion. (a) A real image of a redshank. (b) A demonstration of how the standard inversion method [3] fails. (c) The inverted image for this redshank, using our technique. (d) Saliency map for head 7. Attention is mostly applied to the bird body. (e) Saliency map for head 2. This head attends almost everywhere in the image. (Inversion as lens to attention)(a) (b) (c) (d) (e) (f) (g) Inverted image of an indigo bird and visualization of the attention maps for specific query points. (a) The original image. Again, this was obtained with a Google image search and was not in the training set. (b) Shows how previous inversion methods fail to reconstruct the head of the bird and the branch. (c) A successful inversion using our method. (d) Specifically, ?? shows how attention uses our ESA trick to model background, homogeneous areas. (e) Attention applied to the bird. (f) Attention applied with a query on the branch. Notice how attention is non-local and captures the full branch. (Inversion as lens to attention)Figure 7: Upper Panel: YLG conditional image generation on different dog breeds from ImageNet dataset. From up to down: eskimo husky, siberian husky, saint bernard, maltese. Lower Panel: Random generated samples from YLG-SAGAN. Additional generated images are included in the Appendix. (Conclusions and Future Work)Figure 8 (a) (b) (c) (a) Real image of a redshank. (b) Saliency map extracted from all heads of the Discriminator. (c) Saliency map extracted from a single head of the Discriminator. Weighting our loss function with (b) does not have a huge impact, as the attention weights are almost uniform. Saliency map from (c) is more likely to help correct inversion of the bird. We can use saliency maps from other heads to invert the background as well. (Multiple heads and saliency map)(a) Attention masks for Strided Pattern [6]. (b) Attention masks for YLG Strided (Extended Strided with Full Information property) (c) Information Flow Graph associated with Strided Pattern. This pattern does not have Full Information, i.e. there are dependencies between nodes that the attention layer cannot model. For example, there is no path from node 2 of V 0 to node 1 of V 2. (d) Information Flow Graph associated with YLG Strided pattern. This pattern has Full Information, i.e. there is a path between any node of V 0 and any node of V 2. Note that the number of edges is only increased by a constant compared to the Strided Attention Pattern [6], illustrated in ??. (e) This Figure illustrates the original Strided Pattern [6] and the YLG Strided pattern which has Full Information. First row demonstrates the different boolean masks that we apply to each of the two steps. Color of cell [i. j] indicates whether node i can attend to node j. With dark blue we indicate the attended positions in both steps. With light blue the positions of the first mask and with green the positions of the second mask. The yellow cells correspond to positions that we do not attend to any step (sparsity). The second row illustrates Information Flow Graph associated with the aforementioned attention masks. An Information Flow Graph visualizes how information ”flows” in the attention layer. Intuitively, it visualizes how our model can use the 2-step factorization to find dependencies between image pixels. At each multipartite graph, the nodes of the first vertex set correspond to the image pixels, just before the attention. An edge from a node of the first vertex set, V 0 , to a node of the second vertex set, V 1 , means that the node of V 0 can attend to node of V 1 at the first attention step. Edges between V 1 , V 2 illustrate the second attention step. (Strided Pattern)Figure 10: More inversions using our technique. To the left we present real images and to the right our inversions using YLG SAGAN. (More inversion visualizations)Figure 13: Generated images from YLG SAGAN divided by ImageNet category. (Generated images)(a) Real image. (b) Inversion with our method. (c) Weighted inversion at Generator. (d) Inversion using the standard method [3]. (e) Inversion with different methods of the real image of ??. Our method, ??, is the only successful inversion. The inversion using the weights from the saliency map to the output of the Generator, ??, fails badly. The same holds for inversion using the standard method in the literature [3], as shown in ??. (Weighted inversion at the generator space)Figure 14: Generated images from YLG SAGAN divided by ImageNet category. (Generated images)Figure 15: Generated images from YLG SAGAN divided by ImageNet category. (Generated images)›