[1912.03249v1] Gaussian Process Priors for View-Aware Inference
In the final example (??), we showed how the model can be of direct practical value by acting as a camera-motion-aware interpolator

Abstract We derive a principled framework for encoding prior knowledge of information coupling between views or camera poses (translation and orientation) of a single scene. While deep neural networks have become the prominent solution to many tasks in computer vision, some important problems not so well suited for deep models have received less attention. These include uncertainty quantification, auxiliary data fusion, and real-time processing, which are instrumental for delivering practical methods with robust inference. While these are central goals in probabilistic machine learning, there is a tangible gap between the theory and practice of applying probabilistic methods to many modern vision problems. For this, we derive a novel parametric kernel (covariance function) in the pose space, SE(3), that encodes information about input pose relationships into larger models. We show how this soft-prior knowledge can be applied to improve performance on several real vision tasks, such as feature tracking, human face encoding, and view synthesis.
‹Figure 1. Geodesic ?? Figure 2. Quaternion ?? Figure 3. Covariance function (diagonal) Figure 4. Separable ?? Figure 5. Non-separable ?? Figure 6. Covariance function (when θ2 ≡ 0) Figure 7. (Left) Covariance function between two degrees-offreedom rotations (for simpler visualization) with scale 0 1 and ` = 1. (a) uses the geodesic distance, (b) the quaternion norm distance, (d) shows the separable periodic covariance function, and (e) the proposed non-separable covariance function. (Right) Crosssections along the diagonal and θ2 ≡ 0, showing that in 1D (d) and (e) coincide, while (e) is symmetric in 2D/3D. (Camera pose priors) (Application experiments)

Figure 8. View poses Figure 9. View cov Figure 10. Object cov Figure 11. Test set chairs with varying identity and threedimensional view pose Figure 12. Results from experiments on the PoseNet chairs data set. (a) Visualization of the 18 camera view angles considered in the data. (b–c) View and object covariances resulting from GPPVAE jointly learning the object features and hyperparameters. (d) Comparison of out-of-sample predictions of chairs in out-of-sample views, where the proposed view prior delivers sharper predictive samples. (Comparison of camera motion kernels)Figure 13. Two examples of view-aware latent space interpolation using the latent space of StyleGAN [24]. These reconstructions are based on only taking the first and the last frame of short side-to-side video sequences (Input #1 and #2), encoding them into the GAN latent space, and interpolating the intermediate frames using only the information of the associated camera poses (from Apple ARKit) captured on an iPhone XS. The intermediate frames were recovered by regressing the latent space with our view-aware GP prior. The frames are re-created in correct head orientations. The irregular angular speed of the camera movement (not shown) is precisely captured by our method, resulting in non-symmetric interpolation. See supplementary material for video examples. (View synthesis with a Gaussian process prior variational autoencoder)Figure 14. Row #1: Frames separated by equal time intervals from a camera run, aligned on the face. Row #2: Each frame independently projected to GAN latent space and reconstructed. Row #3: Frames produced by reconstructing the first and the last frame and linearly interpolating the intermediate frames in GAN latent space. Row #4: Frames produced by reconstructing the first and the last frame, but interpolating the intermediate frames in GAN latent space by our view-aware GP prior. It can be seen that although linear interpolation achieves good quality, the azimuth rotation angle of the face is lost, as expected. With the view-aware prior, the rotation angle is better preserved. Row #5: The perpixel uncertainty visualized in the form of standard deviation of the prediction at the corresponding time step. Heavier shading indicates higher uncertainty around the mean trajectory. (View synthesis with a Gaussian process prior variational autoencoder)Figure 22. Pose covariance matrix for all the 789 frames in the video in the tracking experiment in ??. (Details on the feature tracking experiment)Figure 15. Geodesic ?? Figure 16. Quaternion ?? Figure 17. Distance (diagonal) Figure 18. Separable ?? Figure 19. Non-separable ?? Figure 20. Distance (when θ2 ≡ 0) Figure 21. (Left) Distance matrices between two degrees-offreedom rotations (for simpler visualization) with scale 0 π. (a) shows the geodesic distance, (b) the quaternion norm distance, (d) the separable periodic distance, and (e) the non-separable orientation distance. (Right) Distance evaluations along the diagonal and when θ2 ≡ 0. (Link between the standard periodic kernel and the 3D view kernel)Figure 23. GP regression of u coordinates Figure 24. GP regression of v coordinates Figure 25. Pose covariance Figure 26. GP regression results of three tracks (out of 533) using the pose kernel in ??. The red points corresponds to ground-truth trajectories, where ‘??’ means training points and ‘??’ corresponds to unseen points. The blue points are predicted mean values. The shaded patches denote the 95% quantiles. (Details on the feature tracking experiment)Figure 27. Two more examples of view-aware latent space interpolation using the latent space of StyleGAN [24]. These reconstructions are based on only taking the first and the last frame of short side-to-side video sequences (Input #1 and #2), encoding them into the GAN latent space, and interpolating the intermediate frames using only the information of the associated camera poses (from Apple ARKit) captured on an iPhone XS. The intermediate frames were recovered by regressing the latent space with our view-aware GP prior. The frames are re-created in correct head orientations. The irregular angular speed of the camera movement (not shown) is precisely captured by our method, resulting in non-symmetric interpolation. (Details on the face reconstruction experiment)