Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

How can pretrained models acquire human-like spatial relational perception and high-level reasoning from limited embodied video data? This paper proposes a VLM-LM synergistic framework integrating a large-scale vision-language model (VLM), a lightweight language model (3B LM), and customized reinforcement learning (RL). It introduces a novel “think-answer logical consistency” reward mechanism to explicitly encourage slow-thinking reasoning patterns—such as systematic analysis and contextual integration. Trained on only 5K embodied video samples, the framework achieves logically consistent reasoning over continuous visual observations. It matches the performance of OpenAI-o1 and Gemini-2.5-pro on both in-distribution and out-of-distribution embodied spatial reasoning benchmarks. To our knowledge, this is the first work to empirically demonstrate the feasibility of efficiently emergent high-level spatial reasoning under low-resource conditions.

Technology Category

Application Category

📝 Abstract

Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in pretrained models via reinforcement learning

Combining VLMs for perception and LMs for reasoning efficiently

Achieving high-level reasoning with limited computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines VLMs for perception and LMs for reasoning

Uses RL with novel think-answer consistency reward

Achieves high performance with minimal training samples

🔎 Similar Papers

No similar papers found.