[2001.01026v1] Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings
Our results, including human evaluations, indicate that the proposed model is a powerful first tool for capturing stochastic changes from small video datasets.

Abstract We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, colors, and layers. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learning-based video synthesis methods. We present a probabilistic model that, given a single image of a completed painting, recurrently synthesizes steps of the painting process. We implement this model as a convolutional neural network, and introduce a training scheme to facilitate learning from a limited dataset of painting time lapses. We demonstrate that this model can be used to sample many time steps, enabling long-term stochastic video synthesis. We evaluate our method on digital and watercolor paintings collected from video websites, and show that human raters find our synthesized videos to be similar to time lapses produced by real artists.
‹Figure 1: We present a probabilistic model for synthesizing time lapse videos of paintings. We demonstrate our model on Still Life with a Watermelon and Pomegranates by Paul Cezanne (top), and Wheat Field with Cypresses by Vincent van Gogh (bottom). (Introduction)Figure 2: Several real painting progressions of similarlooking scenes. Each artist fills in the house, sky and field in a different order. (Problem overview)Figure 3: Example digital painting sequences. These sequences show a variety of ways to add paint, including fine strokes and filling (row 1), and broad strokes (row 3). We use red boxes to outline challenges, including erasing (row 2) and drastic changes in color and composition (row 3). (Problem overview)Figure 4: Example watercolor painting sequences. The outlined areas highlight some watercolor-specific challenges, including changes in lighting (row 1), diffusion and fading effects as paint dries (row 2), and specular effects on wet paint (row 3). (Problem overview)Figure 5: We model the change δt at each time step as being generated from the latent variable zt. Circles represent random variables; the shaded circle denotes a variable that is observed at inference time. The rounded rectangle represents model parameters. (Model)Figure 6: We implement our model using a conditional variational autoencoder framework. At training time, the network is encouraged to reconstruct the current frame xt, while sampling the latent zt from a distribution that is close to the standard normal. At test time, the auto-encoding branch is removed, and zt is sampled from the standard normal. We use the shorthand δ̂t = gθ(zt, xt−1, xT ), x̂t = xt−1 + δ̂t. (Model)Figure 7: In sequential CVAE training, our model is trained to reconstruct a training frame (outlined in green) while building upon its previous predictions for S time steps. (Model)Figure 8: In sequential sampling training, we use a conditional frame critic to encourage all frames sampled from our model to look realistic. The image similarity loss on the final frame encourages the model to complete the painting in τ time steps. (Model)Figure 9: Diversity of sampled videos. We show examples of our method applied to a digital (top 3 rows) and a watercolor (bottom 3 rows) painting from the test set. Our method captures diverse and plausible painting trajectories. (Implementation)Figure 10: Videos predicted from the digital (top) and watercolor (bottom) test sets. For the stochastic methods vdp and ours, we show the nearest sample to the real video out of 2000 samples. We show additional results in the appendices. (Implementation)Figure 11: Failure cases. We show unrealistic effects that are sometimes synthesized by our method, for a watercolor painting (top) and a digital painting (bottom). (Qualitative results)Figure 12: Quantitative measures. As we draw more samples from each stochastic method (solid lines), the best video similarity to the real video improves. This indicates that some samples are close to the artist’s specific painting choices. We use L1 distance as the metric on the left (lower is better), and stroke IOU on the right (higher is better). Shaded regions show standard deviations of the stochastic methods. We highlight several insights from these plots. (1) Both our method and vdp produce samples that are comparably similar to the real video by L1 distance (left). However, our method synthesizes strokes that are more similar in shape to those used by artists (right). (2) At low numbers of samples, the deterministic unet method is closer (by L1 distance) to the real video than samples from vdp or ours, since L1 favors blurry frames that average many possibilities. (3) Our method shows more improvement in L1 distance and stroke area IOU than vdp as we draw more samples, indicating that our method captures a more varied distribution of videos. (Qualitative results)Figure 13: We use an encoder-decoder style architecture for our model. For our critic, we use a similar architecture to StarGAN [10], and optimize the critic using WGAN-GP [19] with a gradient penalty weight of 10 and 5 critic training iterations for each iteration of our model. All strided convolutions and downsampling layers reduce the size of the input volume by a factor of 2. (Network architecture)Figure 14: Videos synthesized from the watercolor paintings test set. For the stochastic methods vdp and ours, we examine the nearest sample to the real video out of 2000 samples. We discuss the variability among samples from our method in Section ??, and in the supplementary video. (Additional results)Figure 15: Videos synthesized from the watercolor paintings test set. For the stochastic methods vdp and ours, we show the nearest sample to the real video out of 2000 samples. (Additional results)