🤖 AI Summary
Current vision-language-action (VLA) models lack the capability to infer human implicit intent, limiting their effectiveness in complex, real-world human-robot interaction. To address this, we propose an intent-aware curriculum training paradigm integrated with a lightweight embodied reasoning mechanism, enabling zero-shot interaction without explicit instructions for the first time. Our method builds upon vision-language models and jointly pretrains on three complementary data modalities: intent inference, spatial grounding, and embodied reasoning; multimodal reasoning outputs directly condition action generation. Experiments demonstrate substantial improvements: +18% success rate over π₀ under direct指令, +28% over ECoT under intent-based instructions, more than 2× performance on out-of-distribution tasks, and a 40% zero-shot interaction success rate—marking significant advances in VLA model generalizability and practicality for open-world scenarios.
📝 Abstract
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose extbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $π_0$, achieving 18% higher success rates with direct instructions and 28% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.