In recent years, the field of vision-and-language studies has witnessed significant advancements, aiming to bridge the gap between visual perception and linguistic understanding. These studies have explored various approaches to enhance the capabilities of AI systems in generating natural language or visual content, understanding multimodal scenarios, and conducting commonsense reasoning. Despite these advancements, there remains a crucial need for further progress to enable more collaborative and comprehensive interactions between vision and language modalities. This dissertation addresses this need through three primary contributions:
First, I introduce the concept of machine imagination for natural language processing studies. Specifically, I present the use of visual information generated by machines for the automatic evaluation of natural language generation, natural language understanding, and natural language generation.
Second, I explore the utilization of large language models (LLMs) to enhance the performance of vision and multimodal tasks. In particular, I examine the effectiveness of applying LLMs for prompt editing in text-to-image generation, compositional layout planning and generation, and vision-and-language navigation.
Third, I outline my contributions to publicly available open-source vision-and-language research. Specifically, we introduce Multimodal C4, a large-scale multimodal dataset containing interleaved images and text, which we used to train the large-scale multimodal model OpenFlamingo. Additionally, we introduce VisIT-Bench, a public benchmark for evaluating instruction-following vision-language models in real-world applications.
This dissertation aims to push the boundaries of vision-and-language integration, providing new insights and tools for developing more sophisticated AI systems capable of seamless multimodal interactions.