IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work addresses the instability and action-chunk conflicts in vision-language-action (VLA) policies operating under partial observability, which arise from myopic intent discrepancies. To mitigate these issues, the authors propose a history-aware VLA framework that encodes recent visual observations into compact myopic intent representations to guide action-chunk generation, thereby enhancing policy consistency and stability. The study introduces myopic intent modeling as a novel mechanism to alleviate observation aliasing—a common challenge in partially observable environments—and presents AliasBench, a dedicated evaluation benchmark for this purpose. Extensive experiments across multiple simulated multi-task environments, including AliasBench, RoboTwin2, and SimplerEnv, demonstrate that the proposed method significantly outperforms strong existing VLA baselines, validating its effectiveness and robustness.
📝 Abstract
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
Problem

Research questions and friction points this paper is trying to address.

short-horizon intent
observation aliasing
robot imitation learning
VLA policy
multimodal demonstration
Innovation

Methods, ideas, or system contributions that make the work stand out.

IntentVLA
short-horizon intent
visual-language-action policy
observation aliasing
history-conditioned modeling