[1909.12830v1] The Differentiable Cross-Entropy Method
Inspired by DCEM, other more powerful sampling-based optimizers could be made differentiable in the same way, potentially optimizers that leverage gradient-based information in the inner optimization steps (Sekhon & Mebane, 1998

Abstract: We study the Cross-Entropy Method (CEM) for the non-convex optimization of a
continuous and parameterized objective function and introduce a differentiable
variant (DCEM) that enables us to differentiate the output of CEM with respect
to the objective function's parameters. In the machine learning setting this
brings CEM inside of the end-to-end learning pipeline where this has otherwise
been impossible. We show applications in a synthetic energy-based structured
prediction task and in non-convex continuous control. In the control setting we
show on the simulated cheetah and walker tasks that we can embed their optimal
action sequences with DCEM and then use policy optimization to fine-tune
components of the controller as a step towards combining model-based and
model-free RL.

L_(n, k) ==(p in R**n mid 0 <= p <= 1 *(and) 1**top * p == k)

Figure 1: We trained an energy-based model with unrolled gradient descent and DCEM for a 1D regression task with the target function shown in black. Each method unrolls through 10 optimizer steps. The contour surfaces show the (normalized/log-scaled) energy surfaces, highlighting that unrolled gradient descent models can overfit to the number of gradient steps. The lighter colors show areas of lower energy. (Unrolling optimizers for regression and structured prediction)Figure 2: Visualization of the samples that CEM and DCEM generate to solve the cartpole task starting from the same initial system state. The plots starting at the top-left show that CEM initially starts with no temporal knowledge over the control space whereas embedded DCEM’s latent space generates a more feasible distribution over control sequences to consider in each iteration. Embedded DCEM uses an order of magnitude less samples and is able to generate a better solution to the control problem. The contours on the bottom show the controller’s cost surface C(z) from ?? for the initial state — the lighter colors show regions with lower costs. (Unrolling optimizers for regression and structured prediction)Figure 3: Our RSSM with action sequence embeddings (Scaling up to the cheetah and walker)Figure 10: Learned DCEM reward surfaces for the cartpole task. Each row shows a different initial state of the system. We can see that as the temperature decreases, the latent representation can still capture near-optimal values, but they are in much narrower regions of the latent space than when τ = 1. (More details: Cartpole experiment)Figure 4: We evaluated our final models by running 100 episodes each on the cheetah and walker tasks. CEM over the full action space uses 10,000 trajectories for control at each time step while embedded DCEM samples only 1000 trajectories. DCEM almost recovers the performance of CEM over the full action space and PPO fine-tuning of the model-based components helps bridge the gap. (Scaling up to the cheetah and walker)Figure 5: Left: Convergence of DCEM and unrolled GD on the regression task. Right: Ablation showing what happens when DCEM and unrolled GD are trained for 10 inner steps and then a different number of steps is used at test-time. We trained three seeds for each model and the shaded regions show the 95% confidence interval. (More details: Simple regression task)Figure 6: Visualization of the predictions made by ablating the number of inner loop iterations for unrolled GD and DCEM. The ground-truth regression target is shown in black. (More details: Simple regression task)Figure 7: Impact of the activation function on the initial decoder values (More details: Decoder initializations and activation functions)Figure 8: Training and validation loss convergence for the cartpole task. The dashed horizontal line shows the loss induced by an expert controller. Larger latent spaces seem harder to learn and as DCEM becomes less differentiable, the embedding is more difficult to learn. The shaded regions show the 95% confidence interval around three trials. (More details: Cartpole experiment)Figure 9: Improvement factor on the ground-truth cartpole task from embedding the action space with DCEM compared to running CEM on the full action space, showing that DCEM is able to recover the full performance. We use the DCEM model that achieves the best validation loss. The error lines show the 95% confidence interval around three trials. (More details: Cartpole experiment)Figure 11: Phase 1: The two base proprioceptive PlaNet training runs that use CEM over the full action space. The evaluation loss uses 10 episodes and we show a rolling average of the training loss. (More details: Cheetah and walker experiments)Figure 12: Phase 2: The training runs of learning an embedded DCEM controller with online updates. The evaluation loss uses 10 episodes and we show a rolling average of the training loss. (More details: Cheetah and walker experiments)Figure 13: Phase 3: The training run of PPO-fine-tuning into the model-based components — we only use the PPO updates to tune these components and do optimize for the likelihood in this phase. The evaluation loss uses 10 episodes. (More details: Cheetah and walker experiments)Figure 14: Visualization of the DCEM iterates on the cheetah to solve a single control problem starting from a random initial system state The rows show iterates 1, 5, 7, 10 from the top to bottom. (More details: Cheetah and walker experiments)Figure 15: Visualization of the DCEM iterates on the walker to solve a single control problem starting from a random initial system state. The rows show iterates 1, 5, 7, 10 from the top to bottom. (More details: Cheetah and walker experiments)›