From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Weak generalization to unseen scenes and novel tasks—particularly due to scarcity and heterogeneity of embodied data—severely limits zero-shot robotic manipulation performance. To address this, we propose FSD, a novel framework featuring: (1) the first spatial-relation-driven intermediate representation that enables end-to-end bridging from vision-language understanding to embodied action decision-making; (2) a self-consistency constraint that aligns spatial coordinates with visual features; and (3) a hierarchical embodied data pipeline to enhance cross-scene and cross-task generalization. FSD achieves state-of-the-art results on eight spatial and embodied benchmarks, including significant gains on our newly introduced VABench. In zero-shot manipulation, it attains 54.1% success rate on SimplerEnv and an average of 72% across eight real-robot tasks—outperforming the strongest baseline by 30%.

Technology Category

Application Category

📝 Abstract

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both"seeing"and"doing,"achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Problem

Research questions and friction points this paper is trying to address.

Achieving generalization in robotic manipulation for unseen scenarios

Overcoming scarcity and heterogeneity in embodied datasets for VLA models

Enhancing zero-shot performance in robotic decision-making and spatial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates intermediate representations via spatial reasoning

Uses hierarchical data pipeline for training

Aligns spatial coordinates with visual signals

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey