[1910.07224] Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments
With no prior knowledge of its student’s abilities and only loose boundaries on the task space, ALP-GMM, our proposed teacher, consistently outperformed random heuristics and occasionally even expert-designed curricula

Abstract: We consider the problem of how a teacher algorithm can enable an unknown Deep
Reinforcement Learning (DRL) student to become good at a skill over a wide
range of diverse environments. To do so, we study how a teacher algorithm can
learn to generate a learning curriculum, whereby it sequentially samples
parameters controlling a stochastic procedural generation of environments.
Because it does not initially know the capacities of its student, a key
challenge for the teacher is to discover which environments are easy, difficult
or unlearnable, and in what order to propose them to maximize the efficiency of
learning over the learnable ones. To achieve this, this problem is transformed
into a surrogate continuous bandit problem where the teacher samples
environments in order to maximize absolute learning progress of its student. We
present a new algorithm modeling absolute learning progress with Gaussian
mixture models (ALP-GMM). We also adapt existing algorithms and provide a
complete study in the context of DRL. Using parameterized variants of the
BipedalWalker environment, we study their efficiency to personalize a learning
curriculum for different learners (embodiments), their robustness to the ratio
of learnable/unlearnable environments, and their scalability to non-linear and
high-dimensional parameter spaces. Videos and code are available at
this https URL.

‹Figure 1: Multiple students and environments to benchmark teachers. (a): In addition to the default bipedal walker morphology (middle agent), we designed a bipedal walker with 50% shorter legs (left) and a bigger quadrupedal walker (right). (b): The quadrupedal walker is also used in Hexagon Tracks. (Parameterized BipedalWalker Environments with Procedural Generation)Figure 2: Example of an ALP-GMM teacher paired with a Soft Actor-Critic student on Stump Tracks. Figures (a)-(c) show the evolution of ALP-GMM parameter sampling in a representative run. Each dot represents a sampled track distribution and is colored according to its Absolute Learning Progress value. After initial progress on the leftmost part of the space, as in (b), most ALP-GMM runs end up improving on track distributions with 1 to 1.8 stump height, with the highest ones usually paired with spacing above 2.5 or below 1, indicating that tracks with large or very low spacing are easier than those in r1, 2.5s. Figure (d) shows for the same run which track distributions of the test set are mastered (i.e rt ą 230, shown by green dots) after 17k episodes. (How do ALP-GMM, Covar-GMM and RIAC compare to reference teachers?)Figure 3: Evolution of mastered track distributions for Teacher-Student approaches in Stump Tracks. The mean performance (32 seeded runs) is plotted with shaded areas representing the standard error of the mean. (How do ALP-GMM, Covar-GMM and RIAC compare to reference teachers?)Figure 4: Teacher-Student approaches in Hexagon Tracks. Left: Evolution of mastered tracks for TeacherStudent approaches in Hexagon Tracks. 32 seeded runs (25 for Random) of 80 Millions steps where performed for each condition. The mean performance is plotted with shaded areas representing the standard error of the mean. Right: A visualization of which track distributions of the test-set are mastered (i.e rt ą 230, shown by green dots) by an ALP-GMM run after 80 million steps. (Are our approaches able to scale to ill-defined high-dimensional task spaces?)Figure 5: Evolution of performance on n-dimensional toy-spaces. The impact of 3 aspects of the parameter space are tested: growing number of meaningful dimensions (top row), growing number of irrelevant dimensions (middle row) and increasing number of hypercubes (bottom row). The median performance (percentage of unlocked hypercubes) is plotted with shaded curves representing the performance of each run. 20 repeats were performed for each condition (for each toy-space). (Experiments on an n-dimensional toy parameter space)Figure 12: Example of tracks generated in Stump Tracks. (Parameterized BipedalWalker Environments)Figure 6: Schematic view of an ALP-GMM teacher’s workflow (Implementation details)Figure 7: Evolution of Oracle parameter sampling for a default bipedal walker on Stump Tracks. Blue dots represent the last 300 sampled parameters, red dots represent all other previously sampled parameters. At first (a), Oracle starts by sampling parameters in the easiest subspace (i.e large stump spacing and low stump height). After 2000 episodes (b), Oracle slid its sampling window towards stump tracks whose stump height lies between 0.6 and 1.1 and a spacing between 3.7 and 4.7. After 10500 episodes (c) this Oracle run reached a challenging subspace that his student will not be able to master. By 15000 episodes, The sampling window did not move as the mean reward threshold was never crossed. (Additional visualizations for Stump Tracks experiments)Figure 8: Evolution of RIAC parameter sampling for a default bipedal walker on Stump Tracks. At first (a), RIAC do not find any learning progress signal in the space, resulting in random splits. After 1500 episodes, RIAC focuses its sampling on the leftmost part of the space, corresponding to low stump heights, for which the SAC student manages to progress. After 15k episodes (c), RIAC spreaded its sampling to parameters corresponding to track distributions with stump heights up to 1.5, with the highest stumps paired with high spacing. By the end of the training (d) the student converged to a final skill level, and thus LP is no longer detected by RIAC, except for simple track distributions in the leftmost part of the space in which occasional forgetting of walking gates leads to ALP signal. (Additional visualizations for Stump Tracks experiments)Figure 9: Box plot of the final performance of each condition on Stump Tracks after 20M steps. Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or 1.5 times the inter-quartile range. Beyond the whiskers are outlier datapoints. (a): For short agents, Random always end-up mastering 0% of the track distributions of the test set, except for a single run that is able to master 3 track distributions (6%). LP-based teachers obtained superior performances than Random while still failing to reach non-zero performances by the end of training in 13{32 runs for ALPGMM, 15{32 for Covar-GMM and 19{32 for RIAC. (b): For default walkers, LP-based approaches have less variance than Oracle (visible by the difference in inter-quartile range) whose window-sliding strategy led to catastrophic forgetting occurring in a majority of runs. Random remains the least performing algorithm. (c): For quadrupedal walkers, Oracle performs significantly worse than any other condition (p ă 10´5 ). Additional investigations on the data revealed that, by sliding its sampling window towards track distributions with higher stump heights and lower stump spacing, Oracle’s runs mostly failed to master track distributions that were both hard and distant from its sampling window within the parameter space: that is, tracks with both high stump heights (ą 2.5) and high spacing (ą 3.0). (Additional visualizations for Stump Tracks experiments)Figure 13: Generation of obstacles in Hexagon Tracks. Given a default hexagonal obstacle, the first 10 values of a 12-D parameter are used as positive offsets to the x and y positions of all vertices (except for the y position of the first and last ones, in order to ensure the obstacle has at least one edge in contact with the ground. (Parameterized BipedalWalker Environments)Figure 14: Example of tracks generated in Hexagon Tracks with additional examples of encounterable obstacles. (Parameterized BipedalWalker Environments)Figure 10: Evolution of mean performance of Teacher-Student approaches when increasing the amount of unfeasible tracks in Stump Tracks with default bipedal walkers. 32 seeded runs where performed for each condition. The mean performance is plotted with shaded areas representing the standard error of the mean. ALP-GMM is the most robust LP-based teacher and maintains a statistically significant performance advantage over all other conditions in all 3 settings. Random performances are most impacted when increasing the number of unfeasible tracks. ALP-GMM is more robust than RIAC when going from a maximal stump height of 3 to 4 and 3 to 5. Note that for all 3 experiments, for comparison purposes, the same test set was used and contained only track distributions with a maximal stump height of 3. (Additional visualizations for Stump Tracks experiments)Figure 11: Box plot of the final performance of each condition run on Hexagon Tracks after 80M steps. Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or 1.5 times the inter-quartile range. Beyond the whiskers are outlier datapoints. (Additional visualization for Hexagon Tracks experiments)›