Point What You Mean: Visually Grounded Instruction Policy

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Vision-Language-Action (VLA) models exhibit significantly degraded object reference and localization performance under complex or out-of-distribution scenarios when relying solely on textual prompts. Method: We propose Point-VLA, a framework that explicitly incorporates visual prompts—such as bounding boxes—into the VLA pipeline to enable joint text-visual referring expression grounding, thereby improving pixel-level object localization and embodied control accuracy. Contributions/Results: (1) A plug-and-play visual prompt fusion architecture enabling modular integration of diverse visual cues; (2) A low-overhead automated visual annotation pipeline for scalable synthetic data generation and precise grounding supervision; (3) An end-to-end joint fine-tuning strategy. Experiments demonstrate that Point-VLA substantially outperforms text-only VLA baselines across multiple real-world referring tasks, achieving marked improvements in robustness and generalization—particularly in cluttered environments and with previously unseen objects.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Problem

Research questions and friction points this paper is trying to address.

Resolves object referring ambiguity in cluttered scenes

Enhances object-level grounding with visual cues

Improves generalization in unseen-object scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Point-VLA augments language with visual cues

Automatic annotation pipeline scales grounded datasets

Pixel-level visual grounding resolves object ambiguity

🔎 Similar Papers

No similar papers found.