[1910.13962v1] Understanding the Role of Momentum in Stochastic Gradient Methods
We show that it is important to consider these different aspects together to understand the key trade-offs in tuning the learning rate and momentum parameters for better performance in practice

Abstract The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavyball momentum, Nesterov’s accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.
‹

Figure 1 Figure 2 Figure 3 Figure 4 Figure 5: Plots (a), (b) show the dependence of the optimal α, β and convergence rate as a function of ν. We can see that both rate and optimal β are decreasing functions of ν. Plots (c), (d) show the dependence of optimal α, ν and rate on β. We can see that there are three phases in which the dependence is quite different. Also note that in all presented cases, changing ν required changing α in the same way (they increase and decrease together). (Stability region and local convergence rate)Figure 6: Mean loss = 0.11 Figure 7: Mean loss = 0.01 Figure 8: Mean loss = 0.06 Figure 9: Mean loss = 0.15 Figure 10: Mean loss = 0.86 Figure 11: Mean loss = 0.06 Figure 12: Mean loss = 0.44 Figure 13: Mean loss = 0.68 Figure 14: Changes in the shape and size of stationary distribution changes with respect to α, β, and ν on a 2-dimensional quadratic problem. Each picture shows the last 5000 iterates of QHM on a contour plot. The first picture of each row is a reference and other pictures should be compared to it. The second pictures show how the stationary distribution changes when we decrease α. The third and fourth show the dependence on β and ν, respectively. We can see that as expected, moving α → 0 and β → 1 always decreases the achievable loss. However, the dependence on ν is more complicated, and for some values of α and β increasing ν increases the loss (top row), while for other values the dependence is reversed (bottom row). Note the scale change between top and bottom plots. (Stationary analysis)Figure 15: These pictures show dependence of the average final loss (depicted with color: whiter is smaller) on the parameters of QHM algorithm for different problems. The top row shows results for a synthetic 2-dimensional quadratic problem, where all the assumptions of Theorem ?? are satisfied. The red curve indicates the boundary of convergence region (algorithm diverges below it). In this case, we start the algorithm directly at the optimal value to measure the size of stationary distribution and ignore convergence rate. We can see that as predicted by theory, smaller α and bigger β make the final loss smaller. The bottom row shows results of the same experiments repeated for logistic regression on MNIST and ResNet-18 on the CIFAR-10 dataset. We can see that while the assumptions of Theorem ?? are no longer valid, QHM still shows similar qualitative behavior. (Stationary analysis)Figure 16 Figure 17 Figure 18 Figure 19: (a) This plot shows a trade-off between stationary distribution size (final loss) and convergence rate on a simple 2-dimensional quadratic problem. Algorithms that converge faster, typically will converge to a higher final loss. (b) This plot illustrates the regime where there is no trade-off between stationary distribution size and convergence rate. Larger values of α don’t change the convergence rate, while making final loss significantly higher. To make plots (a) and (b) smoother, we plot the average value of the loss for each 100 iterations on y-axis. (c) This plot shows that the same behavior can also be observed in training deep neural networks. For all plots β = 0.9, ν = 1.0. All of the presented results depend continuously on the algorithm’s parameters (e.g. the transition between behaviours shown in (a) and (b) is smooth). (Some practical implications and guidelines)Figure 20: This Figure shows the dependence of optimal (across α, β) local convergence rate on ν for QHM algorithm across different values of condition number κ. Note that R∗(ν, κ) always shows monotonic non-decreasing dependence on ν. (Numerical Evaluation of the Convergence Rate)Figure 21: Approximation error of equation (??). Relative error is shown with color and we threshold it at 0.2. (Approximation error of Theorem ??)›