Deep learning models have made remarkable progress in understanding object-centric visual data by focusing on one or more objects present in the scene. However, they may not perform well for scene-centric visual data that includes multiple objects, backgrounds, and other elements in the scene. Furthermore, the low interpretability of deep learning models due to their high complexity may hinder their trustworthiness and practical deployment in the real world.
In this dissertation, we present a structure-aware representation learning method to address these challenges. Rather than solely relying on end-to-end supervised learning, we first use deep learning to learn multi-level representations that explicitly model the structures of individual objects, such as their shape and geometry, and then map the representations to the final prediction with a shallow classifier like a single-layer perceptron. This approach achieves more robust performance for non-object-centric visual data as it explicitly extracts structure-aware representations from the input. Additionally, the method disentangles representation learning and classification. Thus, analyzing the shallow classifier can provide quantitative interpretation of why a prediction is made. We demonstrate the effectiveness of our approach on several standard computer vision benchmark datasets, as well as real-world medical applications like dry eye disease diagnosis, and 3D dental casting.
Part I of the dissertation describes a segmentation-based method to learn instance-level representations that enable us to understand the individual characteristics of objects. We demonstrate its effectiveness with an application to multi-level gland morphology quantification from medical images for disease diagnosis purposes.
Part II presents approaches for learning geometric shape representations from visual data and how such representations can be used for reconstructing 3D shapes. We provide a medical application of 3D dental casting and jaw reconstruction.
In Part III, based on the instance-level and shape-aware representations from previous parts, we map the representations to the final prediction with a shallow model and show how it could be analyzed and interpreted. We demonstrate an application to demographics prediction from medical images, where we can identify the most relevant features that inform the model's decision and improve its reliability.
To make the proposed method suitable for deployment for practical uses, Part IV introduces how we improve the efficiency of deep learning models using constrained neural optimization. We provide its special cases including orthogonal convolutional neural networks and recurrent parameter generators.