SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robotics foundation models (RFMs) suffer from limited generalization across novel environments, tasks, and robot morphologies due to their reliance on 2D vision-language models (VLMs), which lack inherent 3D spatial reasoning capabilities. To address this, we propose SPEAR: a framework comprising two key components. First, SPEAR-VLM—a 3D-aware VLM trained on large-scale non-robotic images with sparse 3D annotations—enables single-image 3D coordinate regression. Second, SPEAR-1—an end-to-end, language-driven robotics foundation model—is developed via joint multimodal alignment learning and cross-dataset behavior cloning. Trained on the Open X-Embodiment dataset (45M frames), SPEAR-1 matches or surpasses state-of-the-art models such as π₀-FAST while requiring only 5% of their robot demonstration data. This yields substantial improvements in embodied control generalization, reliability, and scalability.

Technology Category

Application Category

📝 Abstract
Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~ extbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$ imes$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.
Problem

Research questions and friction points this paper is trying to address.

Robotic foundation models lack 3D spatial reasoning
Bridging 2D vision-language models to 3D robotics is costly
Scaling robotic control requires enhanced 3D understanding from images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances VLM with 3D understanding from 2D images
Integrates grounded 3D perception with language-instructed control
Uses 3D-annotated non-robotic data to reduce robot demonstrations
🔎 Similar Papers
No similar papers found.