Mathematical reasoning is a pivotal component of human intelligence, crucial for advancing education and science. This dissertation delves into the development of language model systems capable of robust mathematical reasoning, marking a significant step toward realizing general artificial intelligence. We introduce multi-modal and knowledge-intensive benchmarks to assess the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs) across real-world contexts, including visual information, tabular data, and scientific domains.
This dissertation advances the field by proposing new pre-trained VLMs. For instance, Patch-Trm introduces a patch-based cross-modal Transformer model for abstract diagram reasoning. We also present innovative retrieval and tool-augmented algorithms that enhance LLM capabilities. Notably, Inter-GPS is a neuro-symbolic solver for geometry that demonstrates human-level performance, marking a first in the domain. Additionally, PromptPG pioneers the use of reinforcement learning for dynamic in-context example selection, significantly improving the stability and accuracy of LLMs. Another groundbreaking contribution is Chameleon, a model that integrates LLMs with external tools, vastly increasing their flexibility and effectiveness in real-world applications. The dissertation concludes by analyzing the latest advances in mathematical reasoning within visual contexts, and highlighting the current challenges and future prospects.