🤖 AI Summary
This work addresses open-vocabulary object and part instance segmentation—jointly detecting and segmenting hierarchical objects and their parts from an open vocabulary. Methodologically, it introduces multimodal large language models (MLLMs) to part segmentation for the first time, proposing a language-guided hierarchical semantic modeling framework that achieves cross-granularity concept association and zero-shot generalization via vision-language alignment, hierarchical query generation, and autoregressive decoding. Key innovations include language-space-driven semantic structure construction and an MLLM-based query optimization strategy. Experiments demonstrate significant improvements: +5.5% and +4.8% AP on PartImageNet for in-domain and cross-dataset evaluation, respectively, and +2.5% mIoU on zero-shot part segmentation over ADE20K—substantially outperforming prior state-of-the-art methods.
📝 Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.