[1910.05929] Emergent properties of the local geometry of neural loss landscapes
We have shown that four nonintuitive, surprising, and seemingly unrelated properties of the local geometry of the neural loss landscape can all arise naturally in an exceedingly simple random model of Hessians and gradients and how they vary both over training time and weight scale
Abstract The local geometry of high dimensional neural network loss landscapes can both challenge our cherished theoretical intuitions as well as dramatically impact the practical success of neural network training. Indeed recent works have observed 4 striking local properties of neural loss landscapes on classification tasks: (1) the landscape exhibits exactly C directions of high positive curvature, where C is the number of classes; (2) gradient directions are largely confined to this extremely low dimensional subspace of positive Hessian curvature, leaving the vast majority of directions in weight space unexplored; (3) gradient descent transiently explores intermediate regions of higher positive curvature before eventually finding flatter minima; (4) training can be successful even when confined to low dimensional random affine hyperplanes, as long as these hyperplanes intersect a Goldilocks zone of higher than average curvature. We develop a simple theoretical model of gradients and Hessians, justified by numerical experiments on architectures and datasets used in practice, that simultaneously accounts for all 4 of these surprising and seemingly unrelated properties. Our unified model provides conceptual insights into the emergence of these properties and makes connections with diverse topics in neural networks, random matrix theory, and spin glasses, including the neural tangent kernel, BBP phase transitions, and Derrida’s random energy model.
‹
Figure 1: The experimental results on clustering of logit gradients for different datasets, architectures, non-linearities and stages of training. The green bars correspond to qSLSC in Eq. ??, the red bars to qSL in Eq. ??, and the blue bars to qDL in Eq. ??. In general, the gradients with respect to weights of logits k will cluster well regardless of the class of the datapoint µ they were evaluated at. For datapoints of true class k, they will cluster slightly better, while gradients of two logits k 6= l will be nearly orthogonal. This is visualized in Fig ??. (Weakness of logit curvature)Figure 2: A diagram of logit gradient clustering. The kth logit gradients cluster based on k, regardless of the input datapoint µ. The gradients coming from examples µ of the class k cluster more tightly, while gradients of different logits k and l are nearly orthogonal. (Clustering of logit gradients)Figure 3: The motion of probabilities in the probability simplex a) during training in a real network, and b) as a function of logit variance σz in our random model. (a) The distribution of softmax probabilities in the probability simplex for a 3-class subset of CIFAR10 during an early, middle, and late stage of training a SmallCNN. (b) The motion of probabilities induced by increasing the logit variance σ2 z (blue to red) in our random model and the corresponding decrease in the entropy of the resulting distributions. (Freezing of class probabilities both over training time and weight scale)Figure 4: The evolution of logit variance, logit gradient length, and weight space radius with training time. The top left panel shows that the logit variance across classes, averaged over examples, grows with training time. The top right panel shows that logit gradient lengths grow with training time. The bottom left panel shows the weight norm grows with training time. All 3 experiments were conducted with a SmallCNN on CIFAR-10. The bottom right panel shows the logit variance grows as one moves radially out in weight space, at random initialization, with no training involved, again in a SmallCNN. (Deriving loss landscape properties)
Figure 5: The Hessian eigenspectrum in our random model. Due to logit-clustering it exhibits a bulk + C − 1 outliers. To obtain C outliers, we can use mean logit gradients ~ ck whose lengths vary with k (data not shown). (Deriving loss landscape properties)Figure 6: The overlap between Hessian eigenvectors and gradients in our random model. Blue dots denote cosine angles between the gradient and the sorted eigenvectors of the Hessian. The bulk (71% in this particular case) of the total gradient power lies in the top 10 eigenvectors (out of D = 1000) of the Hessian. (Deriving loss landscape properties)Figure 7: The top eigenvalue of the Hessian in our random model as a function of the logit standard deviation σz (∝ training time as demonstrated in Fig. ??). We also model logit gradient growth over training by monotonically increasing σc while keeping σc/σE constant. (Deriving loss landscape properties)Figure 8: The Trace(H)/||H|| as a function of the logit standard deviation σz (∝ training time as show in Fig. ??). This transition is equivalent to what was seen for CNNs in [17]. (Deriving loss landscape properties)›