[1912.05032v1] Imitation Learning via Off-Policy Distribution Matching
We demonstrate the robustness of ValueDICE in a challenging synthetic tabular MDP environment, as well as on standard MuJoCo continuous control benchmark environments, and we show increased performance over baselines in both the low and high data regimes.

When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly datainefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance. [Code to reproduce our results is available at https://github.com/google-research/ google-research/tree/master/value_dice.]
‹

Figure 1: Results of ValueDICE on a simple Ring MDP. Left: The expert data is sparse and only covers states 0, 1, and 2. Nevertheless, ValueDICE is able to learn a policy on all states to best match the observed expert state-action occupancies (the policy learns to always go to states 1 and 2). Right: The expert is stochastic. ValueDICE is able to learn a policy which successfully minimizes the true KL computed between dπ and dexp . (Experiments)Figure 2: Comparison of algorithms given 1 expert trajectory. We use the original implementation of GAIL (Ho & Ermon, 2016) to produce GAIL and BC results. (MuJoCo Benchmarks)Figure 3: Comparison of algorithms given 10 expert trajectories. ValueDICE outperforms other methods. However, given this amount of data, BC can recover the expert policy as well. (MuJoCo Benchmarks)Figure 4: A tabular MDP with action probabilities. Left: deterministic expert demonstrations cover only 3 states (left). Right: a policy learned with ValueDICE produces an optimal policy for all states. (Matching Occupancy in Tabular MDPs)Figure 5: A tabular MDP with action probabilities. Left: a stochastic expert visit all states starting from state 0. Right: ValueDICE matches the occupancy and recover the stochastic policy. (Matching Occupancy in Tabular MDPs)Figure 6: ValueDICE outperforms behavioral cloning given 1 trajectory even without replay regularization. (Additional experiments)›