🤖 AI Summary
This work addresses the issue of state confusion in vision-language-action (VLA) models, which often arises from visual representations being insensitive to subtle state differences, leading to erroneous action predictions. To mitigate this, the study introduces inverse dynamics learning into the VLA framework for the first time, leveraging self-supervision to directly train the visual encoder to predict actions between consecutive observations, thereby enhancing its sensitivity to fine-grained visual changes. Additionally, a pseudo backward supervision mechanism is proposed to broaden action-direction coverage without requiring additional annotations or altering the inference pipeline. Experiments on the CALVIN ABC-D and SimplerEnv benchmarks demonstrate substantial performance gains across multiple VLA baselines, confirming that the learned representations are more state-discriminative, effectively alleviating state confusion and better aligning with robotic state dynamics.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.