Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Embodied AI faces a “perception-to-action gap” due to data scarcity and embodiment heterogeneity. Method: This paper introduces a novel embodied reasoning paradigm centered on *pointing* as a unified intermediate representation, formalizing four categories of embodied pointing capabilities. We construct Embodied-Points-200K—the first large-scale embodied pointing dataset—and propose a two-stage reinforcement fine-tuning (RFT) framework, a multi-task reward mechanism, and a cross-modal alignment strategy. Leveraging a 3B-parameter vision-language model, our approach enables end-to-end mapping from spatial understanding to action generation. Results: Our method achieves state-of-the-art performance on 11 embodied pointing tasks; attains 56.2% zero-shot transfer success on SIMPLEREnv; and reaches 87.5% success rate on eight real-world XArm manipulation tasks—outperforming baselines by 62%—while demonstrating strong robustness to visual disturbances.

Technology Category

Application Category

📝 Abstract
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
Problem

Research questions and friction points this paper is trying to address.

Bridges vision-language comprehension with robotic action primitives
Addresses data scarcity and embodiment heterogeneity in robotics
Develops unified pointing representation for generalizable manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pointing as unified intermediate representation for robotics
Two-stage reinforced fine-tuning curriculum with multi-task rewards
3B vision-language model trained on Embodied-Points-200K dataset
🔎 Similar Papers
No similar papers found.