
[1612.00341] A Compositional ObjectBased Approach to Learning Physical Dynamics
These assumptions are inductive biases that not only give the NPE enough structure to help constrain it to model physical phenomena in terms of objects but also are general enough for the NPE to learn physical dynamics almost exclusively from observation
Abstract: We present the Neural Physics Engine (NPE), a framework for learning
simulators of intuitive physics that naturally generalize across variable
object count and different scene configurations. We propose a factorization of
a physical scene into composable objectbased representations and a neural
network architecture whose compositional structure factorizes object dynamics
into pairwise interactions. Like a symbolic physics engine, the NPE is endowed
with generic notions of objects and their interactions; realized as a neural
network, it can be trained via stochastic gradient descent to adapt to specific
object properties and dynamics of different worlds. We evaluate the efficacy of
our approach on simple rigid body dynamics in twodimensional worlds. By
comparing to less structured architectures, we show that the NPE's
compositional representation of the structure in physical interactions improves
its ability to predict movement, generalize across variable object count and
different scene configurations, and infer latent properties of objects such as
mass.
‹Figure 1: Physics Programs: We consider the space of physics programs over objectbased representations under physical laws that are Markovian and translationinvariant. We consider each object in turn and predict its future state conditioned on the past states of itself and its context objects. (Neural Physics Engine)Figure 2: Scenario and Models: This figure compares the NPE, the NP and the LSTM architectures in predicting the velocity of object 3 for an example scenario [a] of two heavy balls (cyan) and two light balls (yellowgreen). Objects 2 and 4 are in object 3’s neighborhood, so object 1 is ignored. [b]: The NPE encoder consists of a pairwise layer (yellow) and a feedforward network (red) and its decoder (blue) is also a feedforward network. The input to the decoder is the concatenation of the summed pairwise encodings and the input state of object 3. [c]: The NP encoder is the same as the NPE encoder, but without the pairwise layer. The NP decoder is the same as the NPE decoder. The input to the decoder is the concatenation of the summed context encodings and the encoding of object 3. [d]: We shuffle the context objects inputted into the LSTM and use a binary flag to indicate whether an object is a context or focus object. (Baselines)Figure 3: Quantitative evaluation (balls): [a,b]: Prediction and generalization tasks. Top two rows: The cosine similarity and the relative error in magnitude. Bottom row: The MSE of velocity on the test set over the course of training. Because these worlds are chaotic systems, it is not surprising that all predictions diverge from the ground truth with time, but NPE consistently outperforms the other two baselines on all fronts, especially when testing on 6, 7, and 8 objects in the generalization task. The NPE’s performance continues to improve with training while the NPENN (an NPE without a neighborhood mask, see Sec. ??), NP and LSTM quickly plateau. We hypothesize that the NPE’s structured factorization of the state space guides it from wasting time exploring suboptimal programs. [c]: The NPE’s accuracy is significantly greater than the baseline models’ in mass inference. Notably, the NPE achieves similar inference performance whether in the prediction or generalization settings, further showcasing its strong generalization capabilities. The LSTM performs poorest, reaching just above random guessing (33% accuracy). [d]: We analyze the effectiveness of different neighborhood thresholds for the NPE on the constantmass prediction task. The neighborhood threshold is quite robust from 3 to 5 ball radii. (Experiments)Figure 4: Visualizations: The NPE scales to complex dynamics and world configurations while the NP and LSTM cannot. The masses are visualized as: cyan = 25, red = 5, yellowgreen = 1. [a] Consider the collision in the 7 balls world (circled). In the ground truth, the collision happens between balls 1 and 2, and the NPE correctly predicts this. The NP predicts a slower movement for ball 1, so ball 2 overlaps with ball 3. The LSTM predicts a slower movement and incorrect angle off the world boundary, so ball 2 overlaps with ball 3. [b] At first glance, all models seem to handle collisions well in the “O” world (diamond), but when there are internal obstacles (cloud), only the NPE can successfully resolve collisions. This suggests that the NPE pairwise factorization handles object interactions well, letting it generalize to different world configurations, whereas the NP and LSTM have only memorized the geometry of the “O” world. (Different Scene Configurations)Figure 5: Quantitative evalution (walls and obstacles): The compositional state representation simplifies the physical prediction problem to only be over local arrangements of context balls and obstacles, even when the wall geometries are more complex and varied on a macroscopic scale. Therefore, it is not surprising that the models perform consistently across wall geometries. Note that the NPE consistently outperforms the other models, and this gap in performance increases with more varied internal obstacles for the cosine similarity of the velocity angle. This gap is more prominent in “L” and “U” geometries for relative error in magnitude. (Different Scene Configurations)Figure 6: Error analysis on velocity and position: We summarize the error in velocity and position for each traintest variant of each experiment. Normalized velocity MSE is shown in the gray columns (multiplying these values by the maximum velocity of 60 would give the actual velocity in pixels/timestep, where each timestep is about 0.1 seconds). The white columns show the error in Euclidean distance between the predicted position and the ground truth position of the ball. These have been normalized by the radius of the ball (60 pixels), so multiplying these values by 60 would give the actual Euclidean distance in pixels. The NPE consistently outperforms all baselines by 0.5 to 1 order of magnitude, and this is also reflected in the bottom row of Fig. ??a,b. Notice that experiments with variable mass exhibit only slightly higher error than their constantmass variants, even when the variable mass experiments contain masses that differ by a factor of 25. For the experiments with different scene configurations, we do not report error for NPENN; the unnecessary computational complexity of operating on over 30 objects, and the degradation in performance without this mask, evident from the other experiments, make the need for the neighborhood mask clear. (Quantitative Analysis)›

