- Main
Advancing AI Understanding in Language & Vision
- Sharma, Aditya
- Advisor(s): Wang, William;
- Höllerer, Tobias
Abstract
Large Language Models (LLMs) have emerged as a powerful tool, demonstrating impressive capabilities in natural language generation. These pre-trained models consistently outperform benchmarks across a wide range of multi-modal tasks. However, this raises a crucial question: Do LLMs truly understand and reason about the information they process, or are they simply advanced pattern recognizers? This thesis investigates the reasoning and understanding capabilities of language models, aiming to develop more context-aware and intelligent AI systems. Firstly, we introduce WikiWhy, a benchmark designed to evaluate the reasoning capabilities of LLMs in answering and explaining cause-and-effect questions. Next, we present OCTO+, a state-of-the-art suite for automatic object placement in augmented reality, which leverages open-vocabulary Vision Language Models (VLMs) to integrate virtual content seamlessly. Finally, we propose the Visual Needle in a Haystack framework, which assesses the performance of VLMs in long-context reasoning and highlights their challenges with distractor images. By addressing the limitations in long-context reasoning and promoting interpretability, this thesis seeks to unlock the full potential of LLMs and VLMs, enabling them to truly understand and reason about the world.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-