Reconstructing 3D objects from 2D images/videos is a fundamental yet challenging problem in computer vision, which can be applied to a wide range of applications such as AR/VR, gaming, content creation, and autonomous driving. The goal of monocular 3D reconstruction is to estimate the 3D pose and shape of an object in a single-view image or video. Recently, supervised methods have achieved significant progress using various 3D representations like voxel grids, deformable mesh, point clouds, and implicit functions. However, these methods depend heavily on the ground-truth 3D shapes or multi-view images for training, either from synthetic data or human-labeled datasets. Considering the big domain gap between synthetic and natural images as well as the difficulty to annotate large-scale 3D datasets, we aim to develop methods that can utilize weak supervisory signals like the 2D silhouettes, canonical surface mapping, and generic skeleton. Specifically, to produce robust and high-fidelity 3D shapes, we exploit the geometric priors of 3D object parts and dense visibility, semantic consistency between images, as well as generative priors from a Stable Diffusion model. In this thesis, we tackle the problem of monocular 3D reconstruction for three diverse categories: 1) general rigid objects, 2) human bodies, and 3) articulated shapes.
First, we design a reconstruction network to predict the 3D shape of general rigid objects from single-view images. In order to alleviate the 3D ambiguity of 2D appearance, we propose a part-based representation with multiple meshes and regularize the part shape by geometric primitives. We demonstrate that the network can automatically discover useful 3D parts while learning to reconstruct a whole object. In return, the discovered parts can fit the object shape faithfully and help improve the overall reconstruction accuracy. Moreover, the 3D parts enable interesting applications like shape interpolation and generation since they are consistent across instances of the same category.
Second, we learn dense human body estimation that is robust to partial observations. While prior methods with model-based representations can perform reasonably well on whole-body images, they often fail when parts of the body are occluded or outside the frame. Instead, we adopt a heatmap-based representation and explicitly model the visibility of human joints and vertices. The visibility in x and y axes help distinguishing out-of-frame cases, and the visibility in depth axis corresponds to occlusions (either self-occlusions or occlusions by other objects). We show that visibility can serve as 1) an additional signal to resolve depth ordering ambiguities of self-occluded vertices and 2) a regularization term when fitting a human body model to the predictions.
Finally, we propose a novel and practical problem setting to estimate 3D pose and shape of articulated animal bodies given only a few (10-30) in-the-wild images of a particular animal class. Contrary to existing works that rely on pre-defined template shapes, we do not assume any form of 2D or 3D ground-truth annotations, nor do we leverage any multi-view or temporal information. Our key insight is that 3D parts have much simpler shape compared to the overall animal and that they are robust w.r.t. animal pose articulations. Following these insights, we propose three novel optimization frameworks (LASSIE, Hi-LASSIE, and ARTIC3D) which discover 3D skeleton/parts in a self-supervised manner by combining geometric, semantic, and generative priors.