Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of comprehensive evaluation of spatial and interactive reasoning capabilities of vision-language models (VLMs) in real-world physical interaction scenarios. We propose the first benchmark construction methodology for vision-language question answering (VQA) grounded in large-scale, real robot manipulation trajectories. Our approach innovatively leverages embodied robot proprioceptive data—such as pose, force-torque, and gripper state—to reverse-generate VQA questions via trajectory phase segmentation, joint 3D scene–action modeling, and templated question synthesis, yielding the Robo2VLM-1 dataset (684K questions, 463 scenes, 3,396 tasks). The benchmark systematically covers three reasoning categories: spatial relations, goal-directed reasoning, and interaction causality. It significantly improves evaluation fidelity and cross-scene generalization of VLMs on embodied reasoning tasks, and—crucially—enables the first closed-loop transfer from physical manipulation knowledge to visual-language understanding.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs using robot trajectory data
Generating VQA datasets from robot manipulation
Benchmarking VLM spatial and interaction reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates VQA datasets from robot trajectories
Uses non-visual sensory data for ground-truth
Segments trajectories into manipulation phases
🔎 Similar Papers
No similar papers found.