Understanding the real world remains a challenging task for vision-language foundation models, especially when compared to the ease and accuracy of human perception. This dissertation addresses two critical capabilities needed to bridge this gap: spatial understanding and long-context modeling. Spatial understanding allows models to break down scenes into compositional objects, enabling fine-grained interpretation of images and videos and supporting applications like user interaction and image editing. Long-context modeling, on the other hand, is essential for processing extremely long videos and for complex reasoning over extended sequences.
To tackle these challenges, we develop and evaluate a series of novel approaches. We introduce GroupViT, a vision transformer architecture that enables semantic segmentation to emerge from text supervision without requiring mask annotations. We further leverage text-to-image diffusion models to improve open-vocabulary panoptic segmentation, utilizing frozen diffusion UNet features for more generalized scene understanding. To move beyond segmentation, we propose pixel-aligned language models that naturally ground each word in a text response to specific image regions, providing detailed pixel-level information.
For long-context modeling, we introduce an efficient RNN-style TTT layer that surpasses existing alternatives in expressiveness while being significantly more efficient than traditional self-attention mechanisms. This design makes it practical to model videos spanning up to a minute in length.
Through these contributions, this dissertation pushes forward the boundaries of spatial understanding and long-context modeling in vision-language foundation models, paving the way for more versatile and general-purpose intelligent systems.