[1912.02757v1] Deep Ensembles: A Loss Landscape Perspective
shows accuracy and Brier score on CIFAR-10, both on the usual test set (corresponding to the intensity = 0 column) as well as on the CIFAR-10-C benchmark proposed (Hendrycks & Dietterich, 2019) which contains corrupted versions of CIFAR-10 with varying intensity values (1-5), making it useful to verify calibration under dataset shift (Ovadia et al., 2019)

Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable approximate Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. We demonstrate that while low-loss connectors between modes exist, they are not connected in the space of predictions. Developing the concept of the diversity– accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods.
‹Figure 1: Cartoon illustration of the hypothesis. x-axis indicates parameter values and y-axis plots the negative loss −L(θ, {xn, yn}N n=1) on train and validation data. (Introduction)Figure 2: Results using SimpleCNN on CIFAR-10. Left plot: Cosine similarity between checkpoints to measure weight space alignment along optimization trajectory. Middle plot: The fraction of labels on which the predictions from different checkpoints disagree. Right plot: t-SNE plot of predictions from checkpoints corresponding to 3 different randomly initialized trajectories (in different colors). (Visualizing Function Similarity AcrossInitializations)Figure 3: Results on CIFAR-10 using two different architectures. For each of these architectures, the left subplot shows the cosine similarity between different solutions in weight space, and the right subplot shows the fraction of labels on which the predictions from different solutions disagree. (Visualizing Function Similarity AcrossInitializations)Figure 4: Results using SimpleCNN on CIFAR-10: t-SNE plots of validation set predictions for each trajectory along with four different subspace generation methods (showed by squares), in addition to 3 independently initialized and trained runs (different colors). As visible in the plot, the subspacesampled functions stay in the prediction-space neighborhood of the run around which they were constructed, demonstrating that truly different functions are not sampled. (Similarity of Functions Across Subspaces from Each Trajectory)Figure 5: Diversity versus accuracy plots for 3 models trained on CIFAR-10: SmallCNN, MediumCNN and a ResNet20v1. The clear separation between the subspace sampling populations (for 4 different subspace sampling methods) and the population of independently initialized and optimized solutions (red stars) is visible. The 2 limiting curves correspond to solution generated by perturbing the reference solution’s predictions (bottom curve) and completely random predictions at a given accuracy (upper curve). (Similarity of Functions Across Subspaces from Each Trajectory)Figure 6: Results using MediumCNN on CIFAR-10: Radial loss landscape cut between the origin and two independent optima and the predictions of models on the same plane. (Identical loss does not imply identical functions in prediction space)Figure 7: Left: Cartoon illustration showing linear connector (black) along with the optimized connector which lies on the manifold of low loss solutions. Right: The loss and accuracy in between two independent optima on a linear path and an optimized path in the weight space. (Identical loss does not imply identical functions in prediction space)Figure 8: Results using MediumCNN on CIFAR-10: Radial loss landscape cut between the origin and two independent optima along an optimize low-loss connector and predictions similarity along the same planes. (Identical loss does not imply identical functions in prediction space)Figure 9: Results on CIFAR-10 showing the complementary benefits of ensemble and subspace methods, as well as the effect of ensemble size. (Evaluating the Relative Effects of Ensembling versus Subspace Methods)Figure 10: Results on CIFAR-10 using SimpleCNN: clean test and CIFAR-10-C corrupted test set. (Weight averaging within a subspace)Figure 11: Results using ResNet on ImageNet: clean test and ImageNet-C corrupted test set. (Results on ImageNet and ImageNet-C)Figure 12: Diversity versus accuracy plots for a ResNet20v1 trained on CIFAR-100. (Additional diversity – accuracy results on CIFAR-100)Figure 13: Loss landscape versus generalization: weights are typically initialized close to 0 and increase radially through the course of training. Top row: we pick two optima from different trajectories as the axes, and plot loss surface. Looking at x and y axes, we observe that while a wide range of radii achieve low loss on training set, the range of optimal radius values is narrower on validation set. Bottom row: we average weights within each trajectory using WA and use them as axes. A wider range of radius values generalize better along the WA directions, which confirms the findings of Izmailov et al. (2018). (Visualizing the loss landscape along original directions and WA directions)Figure 14: The effect of random initializations and random training batches on the diversity of predictions. For runs on a GPU, the same initialization and the same training batches (red) do not lead to the exact same predictions. On a TPU, such runs always learn the same function and have therefore 0 diversity of predictions. (Effect of randomness: random initialization versus random shuffling)›