[1911.08265v1] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Our method does not require any knowledge of the game rules or environment dynamics, potentially paving the way towards the application of powerful learning and planning methods to a host of real-world domains for which there exists no perfect simulator.
Abstract Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the actionselection policy, and the value function. When evaluated on 57 different Atari games the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
‹Figure 1: Planning, acting, and training with a learned model. (A) How MuZero uses its model to plan. The model consists of three connected components for representation, dynamics and prediction.Given a previous hidden state sk−1 and a candidate action ak , the dynamics function g produces an immediate reward rk and a new hidden state sk . The policy pk and value function vk are computed from the hidden state sk by a prediction function f. The initial hidden state s0 is obtained by passing the past observations (e.g. the Go board or Atari screen) into a representation function h. (B) How MuZero acts in the environment. A Monte-Carlo Tree Search is performed at each timestep t, as described in A. An action at+1 is sampled from the search policy πt, which is proportional to the visit count for each action from the root node. The environment receives the action and generates a new observation ot+1 and reward ut+1. At the end of the episode the trajectory data is stored into a replay buffer. (C) How MuZero trains its model. A trajectory is sampled from the replay buffer. For the initial step, the representation function h receives as input the past observations o1, ..., ot from the selected trajectory. The model is subsequently unrolled recurrently for K steps.At each step k, the dynamics function g receives as input the hidden state sk−1 from the previous step and the real action at+k. The parameters of the representation, dynamics and prediction functions are jointly trained, end-to-end by backpropagation-through-time, to predict three quantities: the policy pk ≈ πt+k, value function vk ≈ zt+k, and reward rt+k ≈ ut+k, where zt+k is a sample return: either the final reward (board games) or n-step return (Atari). (MuZero Algorithm)Figure 2: Evaluation of MuZero throughout training in chess, shogi, Go and Atari. The x-axis shows millions of training steps. For chess, shogi and Go, the y-axis shows Elo rating, established by playing games against AlphaZero using 800 simulations per move for both players. MuZero’s Elo is indicated by the blue line, AlphaZero’s Elo by the horizontal orange line. For Atari, mean (full line) and median (dashed line) human normalized scores across all 57 games are shown on the y-axis. The scores for R2D2 , (the previous state of the art in this domain, based on model-free RL) are indicated by the horizontal orange lines. Performance in Atari was evaluated using 50 simulations every fourth time-step, and then repeating the chosen action four times, as in prior work . (Results)Figure 3: Evaluations of MuZero on Go (A), all 57 Atari Games (B) and Ms. Pacman (C-D). (A) Scaling with search time per move in Go, comparing the learned model with the ground truth simulator. Both networks were trained at 800 simulations per search, equivalent to 0.1 seconds per search. Remarkably, the learned model is able to scale well to up to two orders of magnitude longer searches than seen during training. (B) Scaling of final human normalized mean score in Atari with the number of simulations per search. The network was trained at 50 simulations per search. Dark line indicates mean score, shaded regions indicate 25th to 75th and 5th to 95th percentiles. The learned model’s performance increases up to 100 simulations per search. Beyond, even when scaling to much longer searches than during training, the learned model’s performance remains stable and only decreases slightly. This contrasts with the much better scaling in Go (A), presumably due to greater model inaccuracy in Atari than Go. (C) Comparison of MCTS based training with Q-learning in the MuZero framework on Ms. Pacman, keeping network size and amount of training constant. The state of the art Q-Learning algorithm R2D2 is shown as a baseline. Our Q-Learning implementation reaches the same final score as R2D2, but improves slower and results in much lower final performance compared to MCTS based training. (D) Different networks trained at different numbers of simulations per move, but all evaluated at 50 simulations per move. Networks trained with more simulations per move improve faster, consistent with ablation (B), where the policy improvement is larger when using more simulations per move. Surprisingly, MuZero can learn effectively even when training with less simulations per move than are enough to cover all 8 possible actions in Ms. Pacman. (Results)Figure 4: Repeatability of MuZero in Atari for five games. Total reward is shown on the y-axis, millions of training steps on the x-axis. Dark line indicates median score across 10 separate training runs, light lines indicate individual training runs, and the shaded region indicates 25th to 75th percentile. (Evaluation)Figure 5: Equations summarising the MuZero algorithm. Here, φ(x) refers to the representation of a real number x through a linear combination of its adjacent integers, as described in the Network Architecture section. (Evaluation)Figure 7: Learning curves of MuZero in Atari for individual games. Total reward is shown on the y-axis, millions of training steps on the x-axis. Line indicates mean score across 1000 evaluation games, shaded region indicates standard deviation. (Evaluation)
l_t(theta) == sum(l**r(u_(t + k), r_t**k) + l**v(z_(t + k), v_t**k)
+ l**p(pi_(t + k), p_t**k) + c / theta /**2)
a**k == arg * max_a(Q(s, a) + P(s, a) @ sqrt((sum(N(s, b))))
/(1 + N(s, a)) (c_1 + log((sum(N(s, b) + c_2 + 1))/c_2)))
G**k == sum((gamma**tau * r_(k + 1 + tau))) + gamma**(l - k) Figure 6: Details of MuZero evaluations (A-B) and policy improvement ablations (C-D). (A-B) Distribution of evaluation depth in the search tree for the learned model for the evaluations in Figure 3A-B. The network was trained over 5 hypothetical steps, as indicated by the red line. Dark blue line indicates median depth from the root, dark shaded region shows 25th to 75th percentile, light shaded region shows 5th to 95th percentile. (C) Policy improvement in Ms. Pacman a single network was trained at 50 simulations per search and is evaluated at different numbers of simulations per search, including playing according to the argmax of the raw policy network. The policy improvement effect of the search over the raw policy network is clearly visible throughout training. This consistent gap between the performance with and without search highlights the policy improvement that MuZero exploits, by continually updating towards the improved policy, to efficiently progress towards the optimal policy. (D) Policy improvement in Go a single network was trained at 800 simulations per search and is evaluated at different numbers of simulations per search. In Go, the playing strength improvement from longer searches is much larger than in Ms. Pacman and persists throughout training, consistent with previous results in . This suggests, as might intuitively be expected, that the benefit of models is greatest in precision planning domains. (Evaluation)›