Learning based computer vision algorithms often fail to generalize when encountering new scenarios. This can lead to wildly erroneous predictions that can be fairly obvious to a human. Contrary to humans, these algorithms often lack context or prior knowledge that can be leveraged to constrain their predictions. In this dissertation, we propose to use implicit and multi-modal data-driven priors to improve the reliability of learning based computer-vision algorithms.
In the first work we present Multi-mOdal PosE Diffuser (MOPED), a multi-modal generative pose prior for 3D human pose tasks. MOPED employs a novel multi-modal conditional diffusion model as a prior for 3D human pose estimation. The versatile conditioning mechanism, allowing the generation of realistic human poses from diverse multi-modal inputs such as textual descriptions and images. This flexibility enhances the applicability of our approach, enabling the incorporation of additional context often overlooked in pose priors.
In the second work we explore how to apply (MOPED) to a wide variety of tasks. Specifically, we explore how to generate poses while constraining the output based on a specific task or observation. This allows us to adapt \method{} to new pose related tasks without needing to retrain. We demonstrate \method{}'s effectiveness on the following tasks: Monocular 3D Pose Estimation, Multi-View 3D Pose Estimation, Inverse Kinematic Solvers for Pose Completion, Pose Generation, and finally CLIP loss guided sampling.
In the last work we present Poisson2Sparse, a novel self-supervised learning technique for denoising single images, specifically addressing noise modeled as a Poisson process. Specifically, we approximate traditional iterative optimization algorithms for image denoising with a recurrent neural network which enforces sparsity with respect to the weights of the network. Since the sparse representations are based on the underlying image, it is able to suppress the spurious components (noise) in the image patches, thereby introducing implicit regularization for denoising tasks through the network structure.