[1909.13371] Gradient Descent: The Ultimate Optimizer
Unlike existing work, our proposed hyperoptimizers learn hyperparameters beyond just learning rates, require no manual differentiation by the user, and can be stacked recursively to many levels

Abstract: Working with any gradient-based machine learning algorithm involves the
tedious task of tuning the optimizer's hyperparameters, such as the learning
rate. There exist many techniques for automated hyperparameter optimization,
but they typically introduce even more hyperparameters to control the
hyperparameter optimization process. We propose to instead learn the
hyperparameters themselves by gradient descent, and furthermore to learn the
hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As
these towers of gradient-based optimizers grow, they become significantly less
sensitive to the choice of top-level hyperparameters, hence decreasing the
burden on the user to search for optimal values.

‹Figure 1. The “hyperoptimization surface” described in Section ??. The thin solid traces are of vanilla SGD optimizers with a variety of choices for the hyperparameter α. The thick orange trace is our desired behavior, where the “hyperoptimizer” learns an optimal α over the course of the training, and thus outperforms the vanilla optimizer that begins at the same α. (Introduction)(a) Computation graph of SGD with a fixed hyperparameter α. (b) Computation graph of SGD with a continuously-updated hyperparameter αi. (c) Comparing the computation graphs of vanilla SGD and HyperSGD. (Computing the step-size update rule automatically)Figure 3. As we stack more and more layers of SGD, the resulting hyperoptimizer is less sensitive to the initial choice of hyperparameters. (Higher-Order Hyperoptimization)(a) Higher-order hyperoptimization performance with SGD. (b) Higher-order hyperoptimization performance with Adam. (c) As the stacks of hyperoptimizers grow taller, each step of SGD takes longer by a small constant factor, corresponding to the extra step of stepping one node further in the backwards AD computation graph. (Performance)›