InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
Existing vision-language-action (VLA) models suffer from spurious correlations between task-irrelevant visual features and actions, undermining cross-scenario generalization. To address this, we propose Intrinsic Spatial Reasoning (InSpire), a lightweight mechanism that enhances spatial awareness and causal robustness in VLAs—without additional data or model parameters. InSpire leverages directional spatial questioning and alignment, built upon a pre-trained vision-language foundation model, and integrates instruction prefix augmentation, joint spatial-answer–action alignment, and autoregressive action decoding. Evaluated on both simulation and real-world robotic platforms, InSpire achieves significant improvements in cross-task and cross-environment generalization, while enabling plug-and-play deployment. The code, models, and demonstration videos are publicly released.

Technology Category

Application Category

📝 Abstract
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question"In which direction is the [object] relative to the robot?"to the language instruction and aligning the answer"right/left/up/down/front/back/grasped"and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.
Problem

Research questions and friction points this paper is trying to address.

VLAs spuriously correlate irrelevant visual features with actions
Existing VLAs lack spatial reasoning for task-relevant factors
InSpire enhances VLAs' spatial reasoning without extra data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances VLAs with intrinsic spatial reasoning
Uses directional questions to focus attention
No extra data or large models needed
🔎 Similar Papers
No similar papers found.