🤖 AI Summary
Weak generalization to unseen scenes and novel tasks—particularly due to scarcity and heterogeneity of embodied data—severely limits zero-shot robotic manipulation performance. To address this, we propose FSD, a novel framework featuring: (1) the first spatial-relation-driven intermediate representation that enables end-to-end bridging from vision-language understanding to embodied action decision-making; (2) a self-consistency constraint that aligns spatial coordinates with visual features; and (3) a hierarchical embodied data pipeline to enhance cross-scene and cross-task generalization. FSD achieves state-of-the-art results on eight spatial and embodied benchmarks, including significant gains on our newly introduced VABench. In zero-shot manipulation, it attains 54.1% success rate on SimplerEnv and an average of 72% across eight real-robot tasks—outperforming the strongest baseline by 30%.
📝 Abstract
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both"seeing"and"doing,"achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.