🤖 AI Summary
Existing vision-language-action (VLA) models suffer from spurious correlations between task-irrelevant visual features and actions, undermining cross-scenario generalization. To address this, we propose Intrinsic Spatial Reasoning (InSpire), a lightweight mechanism that enhances spatial awareness and causal robustness in VLAs—without additional data or model parameters. InSpire leverages directional spatial questioning and alignment, built upon a pre-trained vision-language foundation model, and integrates instruction prefix augmentation, joint spatial-answer–action alignment, and autoregressive action decoding. Evaluated on both simulation and real-world robotic platforms, InSpire achieves significant improvements in cross-task and cross-environment generalization, while enabling plug-and-play deployment. The code, models, and demonstration videos are publicly released.
📝 Abstract
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question"In which direction is the [object] relative to the robot?"to the language instruction and aligning the answer"right/left/up/down/front/back/grasped"and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.