[1910.01275v1] A Neural Network for Detailed Human Depth Estimation from a Single Image
This paper proposes a neural network to estimate a detailed depth map for the human body in a single input RGB image
Abstract: This paper presents a neural network to estimate a detailed depth map of the
foreground human in a single RGB image. The result captures geometry details
such as cloth wrinkles, which are important in visualization applications. To
achieve this goal, we separate the depth map into a smooth base shape and a
residual detail shape and design a network with two branches to regress them
respectively. We design a training strategy to ensure both base and detail
shapes can be faithfully learned by the corresponding network branches.
Furthermore, we introduce a novel network layer to fuse a rough depth map and
surface normals to further improve the final result. Quantitative comparison
with fused `ground truth' captured by real depth cameras and qualitative
examples on unconstrained Internet images demonstrate the strength of the
proposed method.
‹Figure 1. The structure of our proposed network. The SkeletonNet and Segmentation-Net generate the heatmaps of 3D skeleton joints and body part segmentation respectively. Their results are further fused with the input image to compute the base shape and detail shape via the Depth-Net. In a separate branch, the NormalNet estimates a surface normal map. The composed shape and normal map are further fused in the depth refinement module to produce the final result. (Overview)Figure 2. Architecture of Depth-Net together with base shape branch and detail shape branch, Normal-Net and Depth Refinement Module. The branches in blue and red dashed rectangles correspond to the detail and base shape branch respectively. (Depth Estimation Network)Figure 3. Comparison of our depth refinement with [30] on a toy example of Sine curve. Left: ground-truth, results from our method and the [30] (from top to bottom). Right: sectional view of these results. (Depth Refinement)Figure 4. Comparison of our depth refinement with the ‘Kernel Regression’ in [30] on a real data. From left to right, there are the ground truth shape, results of our method and the ‘Kernel Regression’ respectively. (Depth Refinement)
Figure 5. Some results on the testing data. From left to right, these images are: the single input RGB image, the ground truth shape and our result. It can be seen that our method is able to recover the main layout as well as certain geometry details. Note that our results are trained on the noisy raw depth images captured by the Kinect2 camera, however, our network is still able to give polished results. (Depth Refinement)Figure 6. Cumulative Distribution Function of depth error of our method and comparison methods [36, 35, 16]. (Experiment)Figure 7. Qualitative comparisons. The first row shows the heatmaps for depth errors, while the second row shows the recovered mesh. Left to right columns: A. Ground truth, B. Ours (Final Shape), C. Ours (Off-the-Shelf), D. SURREAL [36], E. BodyNet [35], F. Laina et al. [16] and G. Kovesi et al. [14] respectively. (Quantitative Results)Figure 8. Comparison of our proposed method and ‘W/o Skeleton and Segmentation Cues’. From left to right, they are the image, ground truth and the results from our method and the setting without Segmentation-Net and Skeleton-Net cues. It is clear that without high-level information to guide the depth estimation, the result might have large shape errors. (Ablation Studies)Figure 9. Comparison of our proposed method and ‘No Depth Separation’. From left to right, they are the image, ground truth and the result from our method and the setting with only one depth branch. We can see that the results without a two-branch architecture are rough and do not have many geometry details. (Ablation Studies)Figure 10. Comparison of our proposed method and ‘Only Stage 1 Training’.From left to right, they are the image, ground truth and the result from our method and the setting with only stage 1 training. From the surface we can see the results without stage 2 will generate wrong wrinkles on the clothes. (Ablation Studies)
Figure 11. Comparison of our proposed method and ‘Huber Loss on Composed Shape’. From left to right, they are the image, ground truth and the result from our method and the setting with Huber loss on the composed depth. We can see the results without using our truncated L1 loss are unstable and not smooth enough. (Ablation Studies)Figure 12. Comparison of our proposed method and ‘Only Stage 2 Training’. From left to right, they are the image, ground truth and the result from our method and the setting with only stage 1 training. By zooming in the results we can see the setting without stage 1 lose majority of geometry details. (Ablation Studies)Figure 13. Some results on unconstrained online images. From left to right, for each example, we show the input image, estimated surface normal and final result. (Qualitative Results)›