Human are interpolating the visual world with very rich understanding. For example, when observing the world through eyes, we not only understand the high level semantic meaning of each region/pixel, more importantly, we also understand the 3D properties like how far away each object is and how the 3D shape of each object is in order to do interaction with the world. In the field of computer vision, however, visual understanding are separated into multiple tasks, e.g. segmentation, 3D reconstruction or object detection etc., due to its high complexity. However, this induces the problem that the results from different strategies are lack of compatibility among different tasks. For example, semantic object detection can not take care of the 3D occlusion regions, while 3D reconstruction does not consider overall semantic context. Thus, in order to have good visual understanding, it is critical to joint understand different tasks while maintaining their compatibility.
Luckily, thanks to the raising technique of deep learning, (a.k.a. convolutional neural network (CNN)), which dramatically beats the other traditional strategies in many visual tasks based on hierarchical learned features with a nearly single framework, we are able to unify different understandings in a more compact and efficient way by designing reasonable output and interaction terms.
However, CNN is not a magic key of solving all problems, and one obvious limitation of CNN is that it contains arbitrarily selected convolutional kernel size and layers, yielding non-adaptive receptive fields to match the variance of object scales. In addition, it is not strait-forward to add arbitrary connections inside each layer based on intuition. Thus, we further embed the conditional random field (CRF) into the system in order to compensate the deficiency in order to unify different cues and perform multiple tasks simultaneously.
In this thesis, we prove the concept through estimating multiple tasks jointly including joint part and object segmentation, joint segmentation and geometry estimation etc. We first show that we can fit deep convolutional network into many different tasks to acquire superior performance compare to traditional shallow features. Secondly, by unifying different tasks with our designed compatibility constrains, we make different tasks mutually regularized and beneficial. Finally, to evaluate the results, we perform our experiments over the standard evaluating benchmarks like PASCAL for segmentation and the NYU v2 dataset for depth estimation. Last but not the least, we not only apply the existing metrics to show the performance gain from our design, but also introduce reasonable new metrics in order to better show the aspect that improved.