[1912.02877v1] Training Agents using Upside-Down Reinforcement Learning
[This highlights the importance of evaluating with a sufficiently large number of random seeds, without which RL can appear on par with DQN.] We conjecture that this environment is rather suitable for TD learning by design due to its dense reward structure and large reward signals at the end

Abstract Traditional Reinforcement Learning (RL) algorithms either predict rewards with value functions or maximize them using policy search. We study an alternative: Upside-Down Reinforcement Learning (Upside-Down RL or RL ), that solves RL problems primarily using supervised learning techniques. Many of its main principles are outlined in a companion report [34]. Here we present the first concrete implementation of RL and demonstrate its feasibility on certain episodic learning problems. Experimental results show that its performance can be surprisingly competitive with, and even exceed that of traditional baseline algorithms developed over decades of research.
‹Figure 2: A key distinction between the action-value function (Q) in traditional RL (e.g. Q-learning) and the behavior function (B) in RL is that the roles of actions and returns are switched. In addition, B may have other command inputs such as desired states or the desired time horizon for achieving a desired return. (Knowledge Representation)Figure 3: A toy environment with four discrete states. Figure 4: A behavior function for the toy environment. (Knowledge Representation)Figure 5: LunarLander-v2 Figure 6: TakeCover-v0 Figure 7: Test environments. In LunarLander-v2, the agent does not observe the visual representation, but an 8-dimensional state vector instead. In TakeCover-v0, the agent observes a down-sampled gray-scale visual inputs. (Environments)Figure 8: On LunarLander-v2, RL is able to train agents that land the spacecraft, but is beaten by traditional RL algorithms. Figure 9: On TakeCover-v0, RL is able to consistently yield high-performing agents, while outperforming DQN and A2C. Figure 10: Evaluation results for LunarLander-v2 and TakeCover-v0. Solid lines represent the mean of evaluation scores over 20 runs using tuned hyperparameters and experiment seeds 1–20. Shaded regions represent 95% confidence intervals using 1000 bootstrap samples. Each evaluation score is a mean of 100 episode returns. (Environments)Figure 11: Results for LunarLanderSparse, a sparse reward version of LunarLander-v2 where the cumulative reward is delayed until the end of each episode. RL learns both much faster and more consistently than both DQN and A2C. Plot semantics are the same as ??. (Results)Figure 12 Figure 13 Figure 14 Figure 15 Figure 16: Obtained vs. desired episode returns for UDRL agents at the end of training. Each evaluation consists of 100 episodes. Error bars indicate standard deviation from the mean. (Sensitivity of Trained Agents to Desired Returns)›