PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large multimodal models (LMMs) face significant bottlenecks in pixel-level part understanding, struggling to identify object-specific parts and thereby limiting fine-grained compositional reasoning. To address this, we introduce PARTONOMY—the first multimodal benchmark for pixel-level part localization—comprising 534 object classes and 862 part labels, supporting part recognition, part-whole relational reasoning, and visually grounded text generation evaluation. Methodologically, we propose PLUM: a novel LMM architecture that replaces incompatible [SEG] tokens with span tagging for part modeling and incorporates a prediction feedback mechanism that dynamically leverages prior segmentation outputs to guide subsequent inference. Experiments demonstrate that PLUM consistently outperforms existing segmentation-capable LMMs across reasoning-based segmentation, visual question answering, and hallucination suppression. Notably, PLUM achieves state-of-the-art performance with only minimal fine-tuning data, yielding substantial gains in generalized Intersection-over-Union (gIoU).

Technology Category

Application Category

📝 Abstract

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing part-level visual understanding in multimodal models

Addressing limitations in object part grounding and segmentation

Improving fine-grained compositional reasoning with novel architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses span tagging instead of segmentation tokens

Conditions on prior predictions in feedback loop

Trains part-centric LMMs for fine-grained understanding

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

2024-10-10arXiv.orgCitations: 5

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow