Deep representation learning has dominated almost every task in computer vision and achieves superior performance. In deep representation learning, deep neural networks are trained on massive data to provide rich visual representations. However, general-purpose neural networks are not fully aware of visual structures, which limit their generalizability on specific vision tasks (e.g., skeleton detection and line segment detection) or under particular scenarios (e.g., few-shot settings). To tackle the aforementioned limitation, we find it natural and essential to enhance the deep representation with visual structures.
This dissertation concentrates on three visual structures: geometric structure, part structure, and multi-scale structure. We then present four examples to study these visual structures in deep representation learning. First, we focus on object skeleton detection and introduce geometric structure in objective function design. Second, we encode part structure in a convolutional neural network for the few-shot image classification. Third, we build a generic vision Transformer with a co-scale structure for image recognition and instance-level prediction. Finally, we present a Transformer model with a multi-scale structure in line segment detection.