🤖 AI Summary
Vision-Language-Action (VLA) models exhibit significantly degraded object reference and localization performance under complex or out-of-distribution scenarios when relying solely on textual prompts. Method: We propose Point-VLA, a framework that explicitly incorporates visual prompts—such as bounding boxes—into the VLA pipeline to enable joint text-visual referring expression grounding, thereby improving pixel-level object localization and embodied control accuracy. Contributions/Results: (1) A plug-and-play visual prompt fusion architecture enabling modular integration of diverse visual cues; (2) A low-overhead automated visual annotation pipeline for scalable synthetic data generation and precise grounding supervision; (3) An end-to-end joint fine-tuning strategy. Experiments demonstrate that Point-VLA substantially outperforms text-only VLA baselines across multiple real-world referring tasks, achieving marked improvements in robustness and generalization—particularly in cluttered environments and with previously unseen objects.
📝 Abstract
Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.