[1910.10897v1] Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
To summarize, we believe that the proposed form of the task suite represents a significant step towards evaluating multi-task and meta-learning algorithms on diverse robotic manipulation problems that will pave the way for future research in these areas
Abstract: [∗ denotes equal contribution] Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is to enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 6 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods. [Videos of the benchmark tasks are on the project page: meta-world.github.io. Our open-sourced codes are available at: https://github.com/rlworkgroup/metaworld].
‹Figure 1: Meta-World contains 50 manipulation tasks, designed to be diverse yet carry shared structure that can be leveraged for efficient multi-task RL and transfer to new tasks via meta-RL. In the most difficult evaluation, the method must use experience from 45 training tasks (left) to quickly learn distinctly new test tasks (right). (Introduction)Figure 4: Visualization of three of our multi-task and meta-learning evaluation protocols, ranging from within task adaptation in ML1, to multi-task training across 10 distinct task families in MT10, to adapting to new tasks in ML10. Our most challenging evaluation mode ML45 is shown in Figure ??. (Evaluation Protocol)Figure 5: Comparison on our simplest meta-RL evaluation, ML1. (Experimental Results and Analysis)Figure 7: Performance of independent policies trained on individual tasks using soft actor-critic (SAC) and proximal policy optimization (PPO). We verify that SAC can solve all of the tasks and PPO can also solve most of the tasks. (Benchmark Verification with Single-Task Learning)Figure 2: Parametric/non-parametric variation: all “reach puck” tasks (left) can be parameterized by the puck position, while the difference between “reach puck” and “open window” (right) is non-parametric. (The Space of Manipulation Tasks: Parametric and Non-Parametric Variability)Figure 6: Full quantitative results on MT10, MT50, ML10, and ML45. Note that, even on the challenging ML10 and ML45 benchmarks, current methods already exhibit some degree of generalization, but meta-training performance leaves considerable room for improvement, suggesting that future work could attain better performance on these benchmarks. We also show the average success rates for all benchmarks in Table ??. (Experimental Results and Analysis)Figure 8: Comparison of PEARL, MAML, and RL2 learning curves on the simplest evaluation, ML1, where the methods need to adapt quickly to new object and goal positions within the one meta-training task. (Learning curves)Figure 9: Learning curves of all methods on MT10, ML10, MT50, and ML45 benchmarks. Yaxis represents success rate averaged over tasks in percentage (%). The dashed lines represent asymptotic performances. Off-policy algorithms such as multi-task SAC and PEARL learn much more efficiently than off-policy methods, though PEARL underperforms MAML and RL2 . (Learning curves)›