IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the instability and action-chunk conflicts in vision-language-action (VLA) policies operating under partial observability, which arise from myopic intent discrepancies. To mitigate these issues, the authors propose a history-aware VLA framework that encodes recent visual observations into compact myopic intent representations to guide action-chunk generation, thereby enhancing policy consistency and stability. The study introduces myopic intent modeling as a novel mechanism to alleviate observation aliasing—a common challenge in partially observable environments—and presents AliasBench, a dedicated evaluation benchmark for this purpose. Extensive experiments across multiple simulated multi-task environments, including AliasBench, RoboTwin2, and SimplerEnv, demonstrate that the proposed method significantly outperforms strong existing VLA baselines, validating its effectiveness and robustness.

📝 Abstract

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

Problem

Research questions and friction points this paper is trying to address.

short-horizon intent

observation aliasing

robot imitation learning

VLA policy

multimodal demonstration

Innovation

Methods, ideas, or system contributions that make the work stand out.

IntentVLA

short-horizon intent

visual-language-action policy