Evaluating and improving the reasoning capabilities of large vision-language models (LVLMs) is essential for advancing their ability to handle complex, real-world tasks. Among these challenges, reasoning over text and visual elements in context-rich environments is particularly critical for tasks such as navigating public spaces or interpreting infographics. These tasks demand context-sensitive, text-rich visual reasoning, where the interplay between textual and visual components within an image is key to understanding. However, existing datasets fall short of benchmarking state-of-the-art multimodal models' capabilities in this domain. To address this gap, we introduce ConTextual, a novel dataset designed with human-crafted instructions tailored for text-rich images. Our study evaluates the performance of 14 foundational models, including GPT-4V, Gemini-Pro-Vision, and LLaVA-Next, and establishes a human performance baseline. The results reveal a significant % performance gap of 30.8% between GPT-4V, the current best-performing large multimodal model, and human-level reasoning. A fine-grained analysis shows that while GPT-4V demonstrates competence in abstract visual contexts such as memes and quotes, it struggles with time-related data and infographics. Additionally, qualitative analysis uncovers shortcomings, including imprecise visual perception and hallucinations.In parallel, we focus on improving reasoning capabilities in LVLMs through preference fine-tuning methods. While techniques such as Direct Preference Optimization (DPO) have shown promise using AI-generated feedback, they often fail to address the noise inherent in synthetic annotations, such as stylistic and length biases. To overcome this limitation, we propose a hard-negative response generation framework. This framework generates rejected responses with targeted errors while ensuring stylistic and length consistency with accepted responses. Using this methodology, we develop the VaPR dataset, comprising 10,000 high-quality samples to refine reasoning in LVLMs.
We fine-tune two LVLM families, LLaVA-V1.5 and Qwen2VL, using the VaPR dataset, leading to significant improvements across nine benchmarks. Our smallest model, Qwen2VL-VaPR-2B, achieves an average gain of 4.6%, while our largest model, LLaVA-VaPR-13B, sees an average improvement of 6.7%. These VaPR-tuned models excel in reasoning tasks such as spatial, textual, general visual, and adversarial scenarios. They also address persistent issues in LVLMs, such as the overuse of "Yes" in binary questions. This work contributes a comprehensive approach to advancing LVLM reasoning by addressing both performance gaps and methodological challenges.
Finally, we release all datasets, models, and code to foster further research and collaboration. The introduction of the ConTextual and VaPR datasets, coupled with rigorous benchmarking and fine-tuning strategies, provides a significant step toward enhancing the reasoning capabilities of large vision-language models in real-world applications.