Before deploying a machine learning model in a real application, it is important to ensure its reliability – this can take many forms, yet is broadly defined as operating without failure. For instance, an incorrect prediction from a model could have a myriad of negative downstream effects, especially if a user has placed trust in the model or if the error is consumed and propagated by automated agents. Multimodal models are growing in their capabilities and applications, yet research into the unique challenges they pose around reliability has been limited.
In this thesis, I cover my work towards improving reliability in the context of multimodal (vision + language) models. This is approached from three different axes: addressing visual biases via model explainability, learning better confidence estimates to abstain from answering questions with high uncertainty as well as reducing hallucinations in generated text, and investigating the contribution of language priors to caption error. In these works, I also present new evaluation frameworks that define particular areas of reliability. As machine learning models take a larger role in our society, carefully measuring and improving reliability becomes more important than ever.