Learning To Exploit Stability For 3D Scene Parsing

Authors:
Yilun Du MIT
Zhijian Liu MIT
Hector Basevi University of Birmingham
Ales Leonardis University of Birmingham
Bill Freeman MIT/Google
Josh Tenenbaum MIT
Jiajun Wu MIT

Introduction:

Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations.The authors then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation.

Abstract:

Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations. Physics is a rich and universal cue which we exploit to enhance scene understanding. We integrate the physical cue of stability into the learning process using a REINFORCE approach coupled to a physics engine, and apply this to the problem of producing the 3D bounding boxes and poses of objects in a scene. We first show that applying physics supervision to an existing scene understanding model increases performance, produces more stable predictions, and allows training to an equivalent performance level with fewer annotated training examples. We then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation. With physics supervision, Prim R-CNN outperforms existing scene understanding approaches on this problem. Finally, we show that applying physics supervision on unlabeled real images improves real domain transfer of models training on synthetic data.

You may want to know: