Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images

2018 
We present a feed-forward, multitask, end-to-end trainable system for the integrated 2d localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation (e.g. grouping people vs. predicting 3d pose). The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a single multi-task deep neural network with differentiable stages where the person grouping problem is formulated as an integer program based on learned body part scores parameterized by both 2d and 3d information. This avoids suboptimality resulting from separate 2d and 3d reasoning, with grouping performed based on the combined information. The calculation can be formally described as a linear binary integer program with globally optimal solution. The final predictive stage of 3d pose and shape is based on a learned attention process where information from different human body parts is optimally fused. State-of-the-art results are obtained in large scale datasets like Human3.6M and Panoptic.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    88
    Citations
    NaN
    KQI
    []