InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing vision-language-action (VLA) models suffer from spurious correlations between task-irrelevant visual features and actions, undermining cross-scenario generalization. To address this, we propose Intrinsic Spatial Reasoning (InSpire), a lightweight mechanism that enhances spatial awareness and causal robustness in VLAs—without additional data or model parameters. InSpire leverages directional spatial questioning and alignment, built upon a pre-trained vision-language foundation model, and integrates instruction prefix augmentation, joint spatial-answer–action alignment, and autoregressive action decoding. Evaluated on both simulation and real-world robotic platforms, InSpire achieves significant improvements in cross-task and cross-environment generalization, while enabling plug-and-play deployment. The code, models, and demonstration videos are publicly released.

Technology Category

Application Category

📝 Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question"In which direction is the [object] relative to the robot?"to the language instruction and aligning the answer"right/left/up/down/front/back/grasped"and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.

Problem

Research questions and friction points this paper is trying to address.

VLAs spuriously correlate irrelevant visual features with actions

Existing VLAs lack spatial reasoning for task-relevant factors

InSpire enhances VLAs' spatial reasoning without extra data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances VLAs with intrinsic spatial reasoning

Uses directional questions to focus attention

No extra data or large models needed

🔎 Similar Papers

No similar papers found.