Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Open Vocabulary Part Grounding in Multimodal Large Language Models

Abstract

We investigate the complexities of open-vocabulary part segmentation, highlighting its greater challenges compared to bounding box detection, which demands less granularity. Part grounding requires a deeper understanding of object structures, as models must differentiate between visually similar parts. Through an evaluation of models such as DesCo, LISA, and VLPart on the PACO dataset, we examine the limitations of these approaches. LISA Description, which leverages descriptive input, performs significantly better in segmentation, achieving an Average AP of 16.3, demonstrating the value of contextual information for part differentiation. However, descriptive training proved ineffective for bounding box detection, as DesCo PACO (+ve) trained without descriptions outperformed the descriptive models with an AP of 23.37. This discrepancy underscores the differing requirements of bounding box detection versus segmentation. Current models continue to struggle with the precision needed for part segmentation, emphasizing the need for further advancements in open-vocabulary part grounding.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View