[1911.06515v1] Likelihood Assignment for Out-of-Distribution Inputs in Deep Generative Models is Sensitive to Prior Distribution Choice

Abstract Recent work has shown that deep generative models assign higher likelihood to out-of-distribution inputs than to training data. We show that a factor underlying this phenomenon is a mismatch between the nature of the prior distribution and that of the data distribution, a problem found in widely used deep generative models such as VAEs and Glow. While a typical choice for a prior distribution is a standard Gaussian distribution, properties of distributions of real data sets may not be consistent with a unimodal prior distribution. This paper focuses on the relationship between the choice of a prior distribution and the likelihoods assigned to out-of-distribution inputs. We propose the use of a mixture distribution as a prior to make likelihoods assigned by deep generative models sensitive to out-of-distribution inputs. Furthermore, we explain the theoretical advantages of adopting a mixture distribution as the prior, and we present experimental results to support our claims. Finally, we demonstrate that a mixture prior lowers the out-of-distribution likelihood with respect to two pairs of real image data sets: Fashion-MNIST vs. MNIST and CIFAR10 vs. SVHN.
‹Figure 1: Motivation for using a multimodal prior distribution from a topological point of view. If the prior distribution is mapped to a distribution with a different topology, the mapped distribution will inevitably have undesirable high likelihood areas. The black and red areas represent the typical sets of the prior and the data distribution, respectively. The gray and yellow areas represent high likelihood areas of the prior and the data distribution, respectively. While the distributions are shown in two-dimensions in this figure, this inconsistency between high likelihood areas and typical sets is a problem observed in high dimensional data. (Topology Mismatch)Figure 2: Standard Gaussian Prior Figure 3: bimodal Gaussian Prior Figure 4: Data Points Figure 5: Visualization of the topology mismatch problem on a two-dimensional Gaussian mixture data. (??, ??) Contours of the log-likelihoods assigned by flow-based generative models using a standard Gaussian prior and a bimodal Gaussian mixture prior. The 10 contour lines in the images range from -10 to -1. The model with a standard Gaussian prior assigns high likelihoods outside the high probability areas of the true distribution. (??) Training data (blue) and out-of-distribution inputs (orange) used in this experiment. (Topology Mismatch)Figure 6: Probability density functions of a standard Gaussian distribution (blue) and a generalized Gaussian distribution with parameters α = p Γ(1/β)/Γ(3/β), β = 4 (orange). (Proposed Model)Figure 7: VAE Figure 8: Glow Figure 9: Histograms of the log-likelihoods assigned by VAEs and Glow trained on Fashion-MNIST (label 1 and 7). “uni” denotes a standard Gaussian prior and “multi” denotes a bimodal Gaussian mixture prior. For Fashion-MNIST, we report likelihoods evaluated on test data. Bimodal priors mitigate the out-of-distribution problem. (Two Labels and Two Modes)Figure 10: bimodal, label 1 and 7 Figure 11: Unimodal, label 7 Figure 12: Histograms of the first-idx of the latent variables on Glow trained on label 1 and 7 and the 613rd-idx trained only on label 7 of Fashion-MNIST. For the unimodal prior model, we select the dimension of the latent variable with the largest absolute mean for MNIST. Further results are reported in Appendix . (Two Labels and Two Modes)Figure 13: Relationships of the distance between two components and the mean log-likelihoods assigned to MNIST by models trained on Fashion-MNIST (label 1 and 7). While likelihoods assigned to out-of-distribution inputs are sensitive to the distance between components regardless of component choice, the Gaussian mixture priors require much larger distances to lower the likelihood assigned to out-of-distribution inputs. The histograms and the mean values of the log-likelihoods are reported in Appendix (Two Labels and Two Modes)Figure 14: Per-dimensional empirical mean of the squared distance between the mean image of each component of the VAE trained on Fashion-MNIST (label 1 and 7) and images in Fashion-MNIST (label 1 and 7). Figure 15: Per-dimensional empirical mean of the squared distance between the mean image of each component of the VAE trained on Fashion-MNIST (label 1 and 7) and images in MNIST. Figure 16: Per-dimensional variance of images in FashionMNIST (label 1 and 7) and MNIST Figure 17: Comparison of the experimental results with the suggestion based on the analysis in Section . (??), (??) The per-dimensional empirical mean of the squared distance from the mean image of each component of the VAE with a bimodal prior distribution. The model is trained on Fasion-MNIST (label 1 and 7) and test images in FashionMNIST (label 1 and 7) and MNIST are assumed to be allocated to the nearest component in the latent variable space. “fashion i” and “mnist i” denote the data allocated to the i-th component. (??) The per-dimensional variances over pixels of images in MNIST and Fashion-MNIST. The y-axis is clipped for visualization. (Two Labels and Two Modes)Figure 18: VAE Figure 19: Glow Figure 20: Likelihoods assigned by the models trained on Fashion-MNIST (label 0, 1, 7, 8). The models with unimodal priors assign higher likelihood to MNIST, whereas those using multimodal priors mitigate this problem. (Results on Complex Data Sets)Figure 21: VAE Figure 22: Glow Figure 23: Likelihood assigned by models trained on CIFAR-10 (label 0 and 7). The models using standard Gaussian priors assign higher likelihood to SVHN. The models using multimodal priors mitigate this problem while the effect is limited on Glow. (Results on Complex Data Sets)Figure 28: Images corresponding to the means of components of the bimodal prior distributions of VAE and Glow trained on label 0 and 7 of FahionMNIST (left). Images in the data set allocated to each cluster (right). Figure 29: VAE with a unimodal prior Figure 30: Glow with a unimodal prior Figure 31: Images corresponding to the means of the unimodal distributions of VAE and Glow trained on label 0 and 7 of Fashion-MNIST (left), and images generated from random sampling (right). The mean image of VAE is dissimilar with training data while image from random sampling are similar with training data. While the mean image of Glow is similar to training data, some images from random sampling of Glow are disimilar with training data. (Mean Images of Clusters)Figure 24: 2 dimensional data Figure 25: 10 dimensional data, Standard Gaussian prior Figure 26: 10 dimensional data, multimodal prior Figure 27: (??) Histograms of the log-likelihoods assigned to training and out-of-distribution data by flow-based generative models trained on simple two-dimensional Gaussian mixture data in Section . “uni” denotes a unimodal prior, and “multi” denotes a multimodal prior. While a model with a unimodal prior assigns relatively high likelihoods to out-of-distribution inputs, a model with a multimodal prior assigns much lower likelihoods to out-of-distribution inputs. (??, ??) The histograms of the log-likelihoods assigned by flow-based generative models for 10 dimensional data. The out-of-distribution problem is more serious for high-dimensional data. (Simple Artificial Data)Figure 32: Sample images in four clusters of label 0. Each row corresponds to one cluster. We use the images in the second row. Figure 33: Per-dimensional variance of images in each cluster of label 0. Figure 34: Sample images in four clusters of label 4. Each row corresponds to one cluster. We use the images in the second row. Figure 35: Per-dimensional variance of images in each cluster of label 4. Figure 36: K-means clustering for CIFAR-10 label 0 and 4. (K-means Clustering for CIFAR-10)Figure 37: VAE, Gaussian mixture Figure 38: VAE, Generalized Gaussian mixture Figure 39: Glow, Gaussian mixture Figure 40: Glow, Generalized Gaussian mixture Figure 41: Distances between two components and loglikelihoods assigned to MNIST by models trained on Fashion-MNIST (label 1 and 4). The results in these images correspond to those in Figure ??. (Distance between Two Components)Figure 42: VAE, Gaussian mixture, Fashion-MNIST (1, 7) Figure 43: VAE, Generalized Gaussian mixture, FashionMNIST (1, 7) Figure 44: Glow, Gaussian mixture, Fashion-MNIST (1, 7) Figure 45: Glow, Generalized Gaussian mixture, FashionMNIST (1, 7) Figure 46: Distances between two components and the loglikelihoods assigned to test data (Fashion-MNIST (1, 7)) by VAEs trained on Fashion-MNIST (1, 7). The likelihoods assigned to the test data are relatively not affected by the distance between two components. (Distance between Two Components)Figure 47: VAE, label 1 Figure 48: VAE, label 7 Figure 49: Glow, label 1 Figure 50: Glow, label 7 Figure 51: Histograms of the log-likelihoods assigned by models with standard Gaussian priors trained on FashionMNIST (label 1 or 7). The results corresponds to those in Table ??. While the models trained on a simpler data set assign lower likelihoods to out-of-distribution inputs, the models using multimodal distributions assign much lower likelihoods. (unimodal Prior Models Trained on Simpler Data)Figure 52: VAE Figure 53: Glow Figure 54: Latent variables on the models with bimodal Gaussian priors trained on Fashion-MNIST (label 1, 7). The latent variables of MNIST reside in out-of-distribution areas. Figure 55: VAE Figure 56: Glow Figure 57: Latent variables on the models with bimodal Gaussian priors trained on CIFAR-10 (label 0, 4). The latent variables of SVHN reside near in-distribution areas. (Histograms of the Latent Variables)Figure 58: VAE, label 1 Figure 59: VAE, label 7 Figure 60: Glow, label 1 Figure 61: Glow, label 7 Figure 62: Histograms of the latent variables on the models with standard Gaussian prior trained on Fashion-MNIST label 1 or 7. We select the dimension whose mean of the latent variables of MNIST is farthest from zero. (Histograms of the Latent Variables)