[1712.06316] LSTM Pose Machines
We did observe some erroneous predictions when the joint is not visible for a long time, but we still found that the LSTM module indeed contributed to the better utilization of temporal information and it made stable and accurate predictions across the video
Abstract: We observed that recent state-of-the-art results on single image human pose
estimation were achieved by multi-stage Convolution Neural Networks (CNN).
Notwithstanding the superior performance on static images, the application of
these models on videos is not only computationally intensive, it also suffers
from performance degeneration and flicking. Such suboptimal results are mainly
attributed to the inability of imposing sequential geometric consistency,
handling severe image quality degradation (e.g. motion blur and occlusion) as
well as the inability of capturing the temporal correlation among video frames.
In this paper, we proposed a novel recurrent network to tackle these problems.
We showed that if we were to impose the weight sharing scheme to the
multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN).
This property decouples the relationship among multiple network stages and
results in significantly faster speed in invoking the network for videos. It
also enables the adoption of Long Short-Term Memory (LSTM) units between video
frames. We found such memory augmented RNN is very effective in imposing
geometric consistency among frames. It also well handles input quality
degradation in videos while successfully stabilizes the sequential outputs. The
experiments showed that our approach significantly outperformed current
state-of-the-art methods on two large-scale video pose estimation benchmarks.
We also explored the memory cells inside the LSTM and provided insights on why
such mechanism would benefit the prediction for video-based pose estimations.
‹Figure 1. Comparison of results produced by Convolutional Pose Machine (CPM)  after setting the video as a series of static images (Up) and our method (Down). Several problems occur during pose estimation on videos: a) Errors and our correct results in estimating symmetric joints. b) Errors and our correct results when joints are occluded. c) Flicking results and our results when the body moves rapidly. (Introduction)Figure 5. Exploration of LSTM’s Memory. a)memory from last stage (i.e. Ct−1) on last frame Xt−1, b)memory from last stage (i.e. Ct−1) on new frame Xt, c)memory after forget operation (i.e. ft Ct−1) on new frame Xt , d)newly selected input(i.e. it gt) on new frame Xt, e)newly formed memory (i.e. Ct) on new frame Xt, which is the element-wise sum of c) and d), and f)the predicted results on new frame Xt. For each samples we pick three consecutive frames. (Exploring and Visualizing LSTM)Figure 4. attention from different memory channels. The first three focus on trunks or edges while the other three focus on a particular joint. (Evaluation on Pose Estimation Results)Figure 2. Network architecture for LSTM Pose Machines. This network consists of T stages, where T is the number of frames. In each stage, one frame from a sequence will be sent into the network as input. ConvNet2 is a multi-layer CNN network for extracting features while an additional ConvNet1 will be used in the first stage for initialization. Results from the last stage will be concatenated with newly processed inputs plus a central Gaussian map, and they will be sent into the LSTM module. Outputs from LSTM will pass ConvNet3 and produce predictions for each frame. The architectures of those ConvNets are the same as the counterparts used in the CPM model  but their weights are shared across stages. LSTM also enables weight sharing, which reduces the number of parameters in our network. (Analysis and Our Approach)Figure 3. Qualitative results of pose estimations on Penn and subJHMDB datasets using our LSTM Pose Machines. (Implementation Details)›
[1707.09695] Recurrent 3D Pose Sequence Machines[1703.10898] Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos[1605.02346] Chained Predictions Using Convolutional Neural Networks[1802.09232] 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning[1711.08585] Exploiting temporal information for 3D pose estimation[1703.10106] Pose-conditioned Spatio-Temporal Attention for Human Action Recognition[1609.00036] Human Pose Estimation in Space and Time using 3D CNN[1602.00134] Convolutional Pose Machines[1607.06416] Hierarchical Attention Network for Action Recognition in Videos[1503.08909] Beyond Short Snippets: Deep Networks for Video Classification
Related: Semantic Math
[1808.06865] Machine Learning for Spatiotemporal Sequence Forecasting: A Survey[1703.01454] Learning Deep Matrix Representations[1803.04988] LCANet: End-to-End Lipreading with Cascaded Attention-CTC[1702.01923] Comparative Study of CNN and RNN for Natural Language Processing[1905.08392] A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural Networks[1703.08864] Learning Simpler Language Models with the Differential State Framework[1511.04108] LSTM-based Deep Learning Models for Non-factoid Answer Selection[1805.11546] Like a Baby: Visually Situated Neural Language Acquisition[1704.08012] Topically Driven Neural Language Model[1906.00748] Improving Minimal Gated Unit for Sequential Data