[1910.11106] Label-Conditioned Next-Frame Video Generation with Neural Flows
We showed that providing the Glow with a context representation of previous frames aids in better video quality and cross entropy evaluation

Abstract: Recent state-of-the-art video generation systems employ Generative
Adversarial Networks (GANs) or Variational Autoencoders (VAEs) to produce novel
videos. However, VAE models typically produce blurry outputs when faced with
sub-optimal conditioning of the input, and GANs are known to be unstable for
large output sizes. In addition, the output videos of these models are
difficult to evaluate, partly because the GAN loss function is not an accurate
measure of convergence. In this work, we propose using a state-of-the-art
neural flow generator called Glow to generate videos conditioned on a textual
label, one frame at a time. Neural flow models are more stable than standard
GANs, as they only optimize a single cross entropy loss function, which is
monotonic and avoids the circular convergence issues of the GAN minimax
objective. In addition, we also show how to condition Glow on external context,
while still preserving the invertible nature of each "flow" layer. Finally, we
evaluate the proposed Glow model by calculating cross entropy on a held-out
validation set of videos, in order to compare multiple versions of the proposed
model via an ablation study. We show generated videos and discuss future
improvements.

‹Fig. 1: Proposed Glow video generation model. Converts dataset image samples into points in Gaussian space, then maximizes the probability of images in that space. During inference, a Gaussian point is sampled and converted back to a realistic image. In this work, we replace images with video frames and add context dependence on previous frames in the video. (Introduction)Fig. 2: The Glow model is able to successfully generate human faces after training on the CelebA dataset. (The Neural Flow)Fig. 3: Gradient norm of label embeddings. (Chosen Dataset)Fig. 4: Here are samples of the first frame of videos generated by the model (top), and real first frames from the dataset (bottom). (Discussion)Fig. 5: Demonstration of the model training on the first four frames of a single video. This demonstrates that the model has the capacity to learn the dynamics of a single video. The boat moves slightly to the right over the video. (Discussion)Fig. 6: Multiple GAN model runs reporting log determinant, log probability of training examples, and cross entropy loss (left), each in a different color. Orange run shows spontaneous burst in loss value followed by NaN (not shown). (Discussion)Fig. 7: Graphs visualizing GAN training. Instabilities in the model caused the mean and standard deviation (top right and bottom left respectively) to converge to zero. In other training setups, these values saturated to the maximum values of the tanh activation (-1 and +1). (Graphs and Visualizations)Fig. 8: This shows the proposed Glow generation model overfitting on a single video. While the first frame generator is powerful enough to model the first frame, the tail generator struggles to maintain coherence throughout the remainder of the video. (Graphs and Visualizations)›