[1910.06403] BoTorch: Programmable Bayesian Optimization in PyTorch
For example, if one can condition on gradient observations of the objective, then it may be possible to apply Bayesian optimization where traditional gradient-based optimizers are used — but with faster convergence and robustness to multimodality
Bayesian optimization provides sample-efficient global optimization for a broad range of applications, including automatic machine learning, molecular chemistry, and experimental design. We introduce BOTORCH, a modern programming framework for Bayesian optimization. Enabled by Monte-Carlo (MC) acquisition functions and auto-differentiation, BOTORCH’s modular design facilitates flexible specification and optimization of probabilistic models written in PyTorch, radically simplifying implementation of novel acquisition functions. Our MC approach is made practical by a distinctive algorithmic foundation that leverages fast predictive distributions and hardware acceleration. In experiments, we demonstrate the improved sample efficiency of BOTORCH relative to other popular libraries. BOTORCH is open source and available at https://github.com/pytorch/botorch.
‹Figure 1. High-level BOTORCH primitives (BOTORCH Architecture)Figure 2. Monte Carlo acquisition functions in BOTORCH. Samples ζi from the posterior P provided by the model M at x ∪ x̃ are evaluated in parallel and averaged as in (??). All operations are fully differentiable. (Acquisition Functions)Figure 3. Composite function (CF) optimization for q = 1, showing log regret evaluated at the maximizer of the posterior mean averaged over 250 trials. The CF variant of BOTORCH’s knowledge gradient algorithm, OKG-CF, achieves superior performance compared to that of EI-CF from Astudillo & Frazier (2019). (Composite Objectives)Figure 4. Locations for 2017 samples produced by NIPV. Observe how the samples cluster in higher variance areas. (Active Learning)Figure 5. Wall times for batched evaluation of qEI Figure 6. Speedups from using fast predictive distributions (Exploiting Parallelism and Hardware Acceleration)Figure 7. Hartmann (d = 6), noisy, best suggested point Figure 8. KG wall times (Computational Scaling)Figure 9. DQN tuning benchmark (Cartpole) Figure 10. NN surrogate model, best observed accuracy (DQN and Cartpole)Figure 11. MC and qMC acquisition functions, with and without re-drawing the base samples between evaluations. The model is a GP fit on 15 points randomly sampled from X = [0, 1]6 and evaluated on the (negative) Hartmann6 test function. The acquisition functions are evaluated along the slice x(λ) = λ1. (Optimization with Fixed Samples)Figure 12. Performance for optimizing qMC-based EI. Solid lines: fixed base samples, optimized via L-BFGS-B. Dashed lines: resampling base samples, optimized via Adam. (Optimization with Fixed Samples)Figure 13. Stochastic/deterministic opt. of EI on Hartmann6 Figure 14. Branin (d = 2) (Synthetic Functions)