Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses five real-world challenges in pixel-level grounding of complex textual instructions by vision-language models (VLMs): spurious referencing, multi-object ambiguity, reasoning dependency, multi-granularity expressions, and part-level referring. We propose the first end-to-end pixel-level instruction-following framework tailored for complex instructions. Our method introduces an automated data expansion paradigm that integrates knowledge distillation from pretrained VLMs, instruction-mask alignment modeling, and pixel-level supervision transfer—significantly reducing annotation overhead. Evaluated on RefCOCO, RefCOCO+, and RefCOCOg benchmarks, our approach establishes new state-of-the-art results: N-Acc on gRefCOCO reaches 83.3% (+20.1%), while average gIoU improves by 4.4% over LISA and 7.9% over PSALM. To our knowledge, this is the first systematic solution to fine-grained, instruction-driven visual grounding under complex linguistic constraints.

Technology Category

Application Category

📝 Abstract
This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Problem

Research questions and friction points this paper is trying to address.

Teaching VLMs to ground complex instructions in pixels
Addressing challenges in text-instruction-based grounding
Generating high-quality data for pixel-level annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained teacher model distillation
Generates high-quality instruction-response pairs
Links to existing pixel-level annotations
🔎 Similar Papers
No similar papers found.
Y
Yongshuo Zong
University of Edinburgh
Q
Qin Zhang
AWS AI Labs
D
Dongsheng An
AWS AI Labs
Zhihua Li
Zhihua Li
Amazon
Deep LearningComputer Vision
X
Xiang Xu
AWS AI Labs
L
Linghan Xu
AWS AI Labs
Zhuowen Tu
Zhuowen Tu
Professor, Cognitive Science, Computer Science&Engineering, UC San Diego
Computer VisionMachine LearningDeep LearningNeural Computation
Y
Yifan Xing
AWS AI Labs
O
O. Dabeer
AWS AI Labs