[1910.01077v1] Task-Relevant Adversarial Imitation Learning
Our proposed method TRAIL effectively focuses the discriminator on the task even when task-irrelevant features are present, enabling it to solve challenging manipulation tasks where GAIL, BC, and DPGfD fail.
Abstract: We show that a critical problem in adversarial imitation from
high-dimensional sensory data is the tendency of discriminator networks to
distinguish agent and expert behaviour using task-irrelevant features beyond
the control of the agent. We analyze this problem in detail and propose a
solution as well as several baselines that outperform standard Generative
Adversarial Imitation Learning (GAIL). Our proposed solution, Task-Relevant
Adversarial Imitation Learning (TRAIL), uses a constrained optimization
objective to overcome task-irrelevant features. Comprehensive experiments show
that TRAIL can solve challenging manipulation tasks from pixels by imitating
human operators, where other agents such as behaviour cloning (BC), standard
GAIL, improved GAIL variants including our newly proposed baselines, and
Deterministic Policy Gradients from Demonstrations (DPGfD) fail to find
solutions, even when the other agents have access to task reward.
‹Figure 1: GAIL and TRAIL succeed at lifting (a), but when distractor objects are added, GAIL fails while TRAIL succeeds (b). Due to robustness to initial conditions, TRAIL can stack from pixels while standard GAIL fails (c). We witness this difference again in insertion with distractors (d). A video showing agents performing these tasks can be seen at https://youtu.be/46rSpBY5p4E. (Introduction)Figure 2: Illustration of several task-irrelevant changes between the expert demonstrations and the distribution of agent observations, for the lift (red cube) task. The naively-trained discriminator network will use these differences rather than task performance to distinguish agent and expert. (Introduction)Figure 3: Results for lift alone, lift distracted, and lift distracted seeded. Only TRAIL excels. (Block lifting with distractors)Figure 4: Lift red block, where expert has a different body appearance, and with distractor blocks. TRAIL-random outperforms GAIL, and performs on par with TRAIL-early. (Constructing the invariant set I)Figure 5: Demonstrating the memorization problem on the lift distracted task (here higher accuracy is worse). Accuracy of different discriminator heads is presented (A-D). In A, the overall accuracy for all timesteps. Then main (m) and extra (e) heads accuracy for the first steps are presented in B and C, respectively. Accuracy of the head predicting randomly assigned demonstration class is shown in D. Average discriminator predictions for training and holdout demonstration are shown in E and F. (Measuring discriminator memorization of task-irrelevant features)Figure 6: Results for lift, box, and stack on Jaco environment. (Actor early stopping (TRAIL-0))Figure 7: When the expert differs in body or prop appearance, TRAIL outperforms GAIL. (Learning from other embodiments and props)Figure 8: Results comparing TRAIL, TRAIL-0 and GAIL for diverse manipulation tasks. (Evaluation on diverse manipulation tasks)Figure 9: Two work spaces, Jaco (left) which uses the Jaco arm and is 20 × 20 cm, and Sawyer (right) which uses the Sawyer arm and more closely resembles a real robot cage and is 35 × 35 cm. (Detailed description of environment)Figure 10: Illustration of the pixels inputs to the agent. (Detailed description of environment)Figure 11: TRAIL performance, varying the number of first frames in each episode used to form I. (Early frames ablation study)Figure 12: Results for lift alone, lift distracted, and lift distracted seeded. (TRAIL-0 with random)Figure 13: Results for stack in Jaco work space. Fixed step termination policy can be very effective but the final performance is very sensitive to the hyperparameter. TRAIL-0 does not need tuning nor access to the environment reward. (Fixed termination policy)Figure 14: Results for lift alone and lift distracted in Sawyer work space. (Data augmentation)Figure 15: With fixed rewards, the agent is able to learn lift alone somewhat, but performs worse than TRAIL. When distractor blocks are added, the fixed reward agent fails to learn completely. (Comparing against learning with fixed rewards)Figure 16: Network architecture for the policy and critic. (Network architecture and hyperparameters)›