RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

193K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited spatial reasoning and embodied interaction capabilities due to the absence of explicit spatial structure and reference-frame annotations in their training data. To address this, we introduce the first large-scale multimodal dataset explicitly designed for robotic spatial understanding—comprising one million 2D egocentric images, 5,000 3D indoor scans, and three million fine-grained spatial relation annotations, supporting joint modeling across egocentric, object-centric, and world-centric reference frames. We present the first end-to-end co-training framework unifying 2D vision-language modeling with 3D spatial understanding, and propose a spatial-relation–affordance joint modeling architecture to enhance scene grounding. Our approach achieves an average accuracy improvement of 12.7% over state-of-the-art methods on spatial relation prediction, affordance recognition, and robotic manipulation tasks—demonstrating the critical role of reference-frame-aware representations in embodied intelligence.

Technology Category

Application Category

📝 Abstract

Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension - spatial relationships require clear contextual understanding, whether from an ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and egocentric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in vision-language models for robotics

Addressing lack of spatial understanding in general-purpose image datasets

Improving robot perception and interaction with 2D/3D spatial data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for spatial understanding

Combines 2D and 3D vision-language models

Rich annotated spatial relationships for robotics

🔎 Similar Papers

Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video