[2001.01678v1] How neural networks find generalizable solutions: Self-tuned annealing in deep learning

\begin{abstract}
Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the learning dynamics and loss function landscape, we discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, we develop a random landscape theory, which shows that the SGD noise strength (effective temperature) depends inversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape. Finally, we demonstrate how these new theoretical insights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting.
\end{abstract}
‹FIG. 1: The PCA results and the drift-diffusion motion in SGD. (A) The rank-ordered variance σ2 i in different principal component (PC) direction i. For i ≥ 20, σ2 i decreases with i as a power law i−γ with γ ∼ 2−3. (B) The normalized accumulative variance of the top (n−1) PCA components excluding i = 1. It reaches ∼ 90% at n = 35 much smaller than the total number of weights N = 2500 between the two hidden layers. (C) The SGD weight trajectory projected onto the (θ1, θ2) plane. The persistent drift motion in θ1 and the diffusive random motion in θ2 are clearly shown. (D) The diffusive motion in (θi, θj) plane with j > i(6= 1) randomly chosen (i = 49 and j = 50 shown here). Unless otherwise stated, hyperparameters used are: B = 50, α = 0.1. (Learning via a low-dimensional drift-diffusion dynamics in SGD)FIG. 2: The loss function landscape and the inverse variance-flatness relation. (A) The loss function profile Li along the i-th PCA direction. (B) The ln(Li) profile. It can be fitted by a quadratic function (the red dotted line). The definition of the flatness Fi is also shown. (C) The flatness Fi for different PCA direction i, and (D) the inverse relation between the variance σ2 i and the flatness Fi for different choices of minibatch size B and learning rate α. (The loss function landscape and the inverse variance-flatness relation)

FIG. 3: Statistical properties of the MLF ensemble. (A) Profiles of the overall loss function ln(Li) (red line) and a set of randomly chosen MLFs ln(Lµ i ) (blue dashed lines) in a given PCA direction i. (B) The variance σ̃2 i of the minimum positions and the average diagonal element M (0) ii of the Hessian matrices of the MLF ensemble versus the flatness Fi of the overall loss function. The combination (M (0) ii )2σ̃2 i versus Fi is also shown. i = 30 used here. (The random landscape theory and origin of the inverse variance-flatness relation)FIG. 5: Profiles and dynamics of the anisotropic active temperature. (A) The active temperature profile Ti(δθ, t) in the i’th PCA direction at t = 200. (B) The minimum active temperature Ti(0) in different PCA direction i. The inverse dependence of Ti on the flatness Fi is shown in the inset. (C) The active temperature profiles Ti(δθ, t) at different times for i = 10. (D) The active temperature decreases with time in sync with the loss function (red line) dynamics. The shaded region highlights the transition between the fast learning phase and the exploration phase. The correlation between Ti and L is shown in the inset. (SGD as a self-tuned (landscape-dependent) annealing strategy for learning)

FIG. 4: The landscape-dependent constraints for avoiding catastrophic forgetting. (A) The test errors for task-1 (1) and task-2 (2) versus training time for task-2 in the absence of the constraints (λ = 0). (B) The weight displacements qi in different PCA directions ~ p1i from task-1 in the absence of the constraints (λ = 0). The threshold ξ ≡ 0.008Fi1 is shown by the red dotted line. (C)&(D) are the same as (A)&(B) but in the presence of the constraints with λ = 10 . (E) The tradeoff between the saturated test errors (1 and 2) when varying λ for LDC (blue circles) and EWC (red squares) algorithms. (F) The overall performance (1 + 2) versus the number of constrains Nc for LDC (blue circles) and EWC (red squares) algorithms. The two tasks are for classifying two separate digit pairs [(0, 1) for task-1 and (2, 3) for task-2] in MNIST. (Preventing catastrophic forgetting by using landscape-dependent constraints)