[1912.02807] Combining Q-Learning and Search with Amortized Value Estimates
More broadly, we suggest that SAVE can be interpreted as a framework for ensuring that the valuable computation performed during search is preserved, rather than being used only for the immediate action or summarized indirectly via frequency statistics of the search policy

We introduce “Search with Amortized Value Estimates” (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and—in contrast to typical model-based search approaches—yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.
‹Figure 2: Illustration of SAVE. When acting, the agent uses a Q-function, Qθ, as a prior for the Q-values estimated during MCTS. Over K steps of search, Q0 ≡ Qθ is built up to QK, which is returned as QMCTS (Equations ?? and ??). From QMCTS, an action a is selected via epsilon-greedy and the resulting experience (s, a, r, s0 , QMCTS) is added to a replay buffer. When learning, the agent uses real experience to update Qθ via Q-learning (LQ) as well as an amortization loss (LA) which regresses Qθ towards the Q-values estimated during search (Equation ??). (Method)Figure 3: Results on Tightrope. (a-c) Tabular results comparing SAVE, PUCT, UCT, and Q-learning (with MCTS at test time) for varying percentages of terminal actions on the x-axes and for different search budgets. The y-axes show reward for either the sparse or dense reward setting of Tightrope. Lines show medians across 20 seeds, with error bars showing 95% confidence intervals. (d) Results on Tightrope when using function approximation, comparing SAVE with PUCT and Q-learning. Lines show medians across 10 seeds, with shaded regions indicating min and max seeds. (Tightrope)

Figure 4: Results on Construction. (a-c) Each subplot shows results for SAVE, SAVE without amortization loss, Q-learning with MCTS at test time, and pure search (UCT). The x-axis shows the effect of increasing the number of MCTS simulations at test time. During training, SAVE with and without amortization loss used a search budget of 10 simulations. UCT used a search budget of 1000 simulations. Points show medians across 10 seeds, and error bars indicate min and max seeds. (d) Ablation experiments on the Covering task. We compare SAVE to variants that do not have an amortization loss, which use an L2 amortization loss, which do not use the Q-Learning loss, and which use PUCT rather than UCT. Results are shown at the hardest level of difficulty for the Covering task with a test budget of 10. The colored bars show median reward across 10 seeds, and error bars show min and max seed. (Construction)Figure 5: (a-b) Results on the Marble Run environment for model-free Q-Learning as well as SAVE as a function of curriculum difficulty level, for two different settings of the cost of “sticky” blocks. Points indicate medians across 10 seeds, and error bars show min and max seeds. (c-d) Structures built by SAVE which solve the same scene for two different costs of sticky blocks (difficulty 6). Additional videos showing agent behavior are available at https://tinyurl.com/yxm4ma47. (Marble Run)Figure 6: Results on Atari. (Discussion)Figure 7: Learning curves on the Covering task. Each plot shows median performance across 10 seeds, with shaded regions showing the min and max seed. (Details on Construction)Figure 8: Detailed final results on the Covering task. Each plot shows median performance across 10 seeds, with error bars showing the min and max seed. (Details on Construction)Figure 9: Performance of different exploration strategies on the Covering task. (Additional results)Figure 10: Performance of SAVE and Q-learning on Covering, controlling for the same number of environment interactions (including those seen during search). (Additional results)Figure 11: Scenes samples at each curriculum level for the marble run task. During training, the n-th level of the curriculum consists of scenes sampled from the rows up to the n-th row with a truncated geometric distribution with a decay of 0.5. (Adaptive curriculum)

Figure 12: Learning curves for the Marble Run environment. Each line shows the median reward across 10 seeds, and the shaded regions show min and max seed performance. Each color corresponds to a different level of curriculum difficulty. Difficulties less than the final difficulty are only evaluated while the agent is training at that curriculum level; the final level of difficulty is always evaluated. (Additional results)Figure 13: Curriculum progress in Marble Run. Light lines show individual curriculum progress per seed, and dark lines are computed over the median of these seeds. The x-axis shows the particular curriculum level and the y-axis indicates at which episode that level of difficulty was reached. (Additional results)Figure 14: Learning curves on Atari games. Solid lines show the average over 3 seeds, and shaded regions show min and max seeds. (Details on Atari)›