[1703.04782] Online Learning Rate Adaptation with Hypergradient Descent
Having rediscovered a general method for adapting hyperparameters of gradient-based optimization procedures, we have applied it to the online tuning of the learning rate, and produced hypergradient descent variants of SGD, SGD with Nesterov momentum, and Adam that empirically appear to significantly reduce the time and resources needed to tune the initial learning rate

Abstract: We introduce a general method for improving the convergence rate of
gradient-based optimizers that is easy to implement and works well in practice.
We demonstrate the effectiveness of the method in a range of optimization
problems by applying it to stochastic gradient descent, stochastic gradient
descent with Nesterov momentum, and Adam, showing that it significantly reduces
the need for the manual tuning of the initial learning rate for these commonly
used algorithms. Our method works by dynamically updating the learning rate
during optimization using the gradient with respect to the learning rate of the
update rule itself. Computing this "hypergradient" needs little additional
computation, requires only one extra copy of the original gradient to be stored
in memory, and relies upon nothing more than what is provided by reverse-mode
automatic differentiation.

‹Figure 1: Stochastic gradient descent (SGD) Figure 2: SGD with hyp. desc. (SGD-HD) Figure 3: SGD with Nesterov (SGDN) Figure 4: SGDN with hyp. desc. (SGDN-HD) Figure 5: Adam Figure 6: Adam with hyp. desc. (Adam-HD) Figure 7: Regular and hypergradient algorithms. Left-hand side: SGD with Nesterov (SGDN) (Algorithm ??) and Adam (Algorithm ??) are obtained by substituting the corresponding initialization (red) and update (blue) statements into regular SGD (Algorithm ??). Right-hand side: Hypergradient variants of SGD with Nesterov (SGDN-HD) (Algorithm ??) and Adam (Adam-HD) (Algorithm ??) are obtained by substituting the corresponding statements into hypergradient SGD (SGD-HD) (Algorithm ??). (Hypergradient Descent)Figure 8: Online tuning of the learning rate for logistic regression and multi-layer neural network. Top row shows the learning rate, middle row shows the training loss, and the bottom row shows the validation loss. Dashed curves represent the regular gradient descent algorithms SGD and Adam, and solid curves represent their HD variants, SGD-HD and Adam-HD. HDM denotes an example of the multiplicative update rule. (Online Tuning of the Learning Rate)

$$$$Figure 9: Behavior of hypergradient variants compared with their regular counterparts. Columns: left: logistic regression on MNIST; middle: multi-layer neural network on MNIST; right: VGG Net on CIFAR-10. Rows: top: evolution of the learning rate αt; middle: training loss; bottom: validation loss. Main plots show epoch averages and inset plots highlight the behavior of the algorithms during initial iterations. For MNIST one epoch is one full pass through the entire training set of 60,000 images (468.75 iterations with a minibatch size of 128) and for CIFAR-10 one epoch is one full pass through the entire training set of 50,000 images (390.625 iterations with a minibatch size of 128). (Online Tuning of the Learning Rate)Figure 10: Grid search for selecting α0 and β, looking at iterations to convergence to a training loss of 0.29 for logistic regression. Everywhere to the left and below the shaded region marked by the red boundary, hypergradient variants (bottom) perform better than or equal to the baseline variants (top). In the limit of β → 0, as one recovers the original update rule, the algorithms perform the same with the baseline variants in the worst case. (Online Tuning of the Learning Rate)›