Pose-Guided Human Semantic Part Segmentation
Human semantic part segmentation and human pose estimation are two fundamental and complementary tasks in computer vision. The localization of joints in pose estimation can be much more accurate with the support of part segment consistency while the local confusions in part segmentation can be greatly reduced with the support of top-down pose information. In natural scenes which consist of multiple people, human pose estimation and human part segmentation are still challenging due to multi-instance confusion and large variations in pose, scale, appearance and occlusion. Current state-of-the-art methods for both tasks rely on deep neural networks to extract data-dependent features, and combine them with a carefully designed graphical model. However, these methods have no efficient mechanism to handle multi-person overlapping or to adapt to the scale of human instances, thus are still limited when facing large variability in human pose and scale.
To improve the performance of both tasks over current methods, we propose three models that tackle the difficulty of pose/scale variation in two major directions: (1) introduce top-down pose consistency into semantic part segmentation and introduce part segment consistency into human pose estimation, letting the two tasks benefit each other; (2) handle the scale variation by designing a mechanism to adapt to the size of human instances and their corresponding parts. Our first model incorporates pose cues into a graphical model-based part segmentation framework while our third model combines pose information within a framework made up of fully convolutional networks (FCN). Our second model is a hierarchical FCN framework that performs object/part scale estimation and part segmentation jointly, adapting to the size of objects and parts. We show that all our three models achieve state-of-the-art performance on challenging datasets.