ROSA: Harnessing Robot States for Vision-Language and Action Alignment

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language-action (VLA) models face a fundamental challenge in aligning high-level, instantaneous semantic representations with low-level, temporally structured physical action spaces. Method: This paper proposes bridging this spatiotemporal gap by introducing robot state estimation—automatically inferred from sensory inputs—as an intermediary between semantic understanding and action execution. We pioneer the integration of learned robot state estimation into the VLA modeling framework, jointly fine-tuning vision-language models and designing a novel spatiotemporal alignment loss to enable co-training across simulation and real-world environments. Contribution/Results: The approach substantially reduces reliance on expert demonstration data: on multi-task robotic control benchmarks under low-data regimes, it outperforms prior methods, achieving a 37% improvement in sample efficiency and significantly enhanced generalization across tasks and environments.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces. By integrating robot state estimation data obtained via an automated process, ROSA enables the VLA model to gain enhanced spatial understanding and self-awareness, thereby boosting performance and generalization. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of ROSA, particularly in low-data regimes.
Problem

Research questions and friction points this paper is trying to address.

Aligning vision-language space with robotic action space
Overcoming spatio-temporal gap in VLA models
Reducing data inefficiency and human labor reliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages robot state estimation for alignment
Integrates automated spatial understanding data
Improves performance in low-data regimes
🔎 Similar Papers
No similar papers found.