[1910.09529v1] Adaptive gradient descent without descent

Abstract We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don’t increase the stepsize too fast and 2) don’t overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on smoothness in a neighborhood of a solution. Given that the problem is convex, our method will converge even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including matrix factorization and training of ResNet-18.
‹

$$$$Figure 1: Scenario 1 Figure 2: Scenario 2 Figure 3: Scenario 3 Figure 4: Results for the quadratic problem. (Decreasing stepsizes in SGD)Figure 5: Mushrooms dataset, objective Figure 6: Mushrooms dataset, stepsize Figure 7: Covtype dataset, objective Figure 8: Results for the logistic regression problem. Note that ‘AdGD’ and ‘AdGD-L’ are almost indistinguishable as the global value of L is quite large compared to the local value, which can be seen from the plot in middle. (Decreasing stepsizes in SGD)Figure 9: r = 10 Figure 10: r = 20 Figure 11: r = 30 Figure 12: Results for matrix factorization. The objective is neither convex nor smooth. (Experiments)Figure 13: Test accuracy Figure 14: Estimated λk Figure 15: Train loss Figure 16: Results for training ResNet-18 on Cifar10. Labels for AdGD correspond to how λk was estimated. (Experiments)Figure 17: M = 10 Figure 18: M = 20 Figure 19: M = 100 Figure 20: Results for the non-smooth subproblem from cubic regularization. (Experiments)