EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLMs/MLLMs employed in embodied agents suffer from three fundamental bottlenecks: architectural misalignment with agent requirements, trade-offs between real-time responsiveness and inference performance, and overreliance on offline evaluations disconnected from dynamic, interactive environments. To address these, we propose Agent-Aligned Vision-Language Foundation Models (AA-VL-FMs). Our method introduces: (1) an embodied-task-oriented data structure; (2) Step-GRPO—a novel training paradigm leveraging preceding steps as guiding priors; and (3) a Generative Reward Model (GRM)-driven, multi-stage simulation-based evaluation framework. Evaluated on 7B and 32B model variants, AA-VL-FMs achieve state-of-the-art performance across generalization, long-horizon planning, and end-to-end simulation benchmarks. To foster reproducibility and community advancement, we publicly release all training data, model weights, evaluation code, and a high-fidelity, challenging simulation environment.

Technology Category

Application Category

📝 Abstract
The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between model design and agent requirements for embodied AI
Overcoming the trade-off between real-time latency and performance in embodied tasks
Addressing limitations of unauthentic offline evaluation metrics for embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes EmbodiedBrain vision-language foundation model
Integrates SFT with Step-GRPO training methodology
Uses Generative Reward Model for efficient training
🔎 Similar Papers
No similar papers found.
Ding Zou
Ding Zou
Huazhong University of Science and Technology
RecommendationKnowledge graphContrastive Learning
Feifan Wang
Feifan Wang
M
Mengyu Ge
S
Siyuan Fan
Z
Zongbing Zhang
W
Wei Chen
L
Lingfeng Wang
Z
Zhongyou Hu
W
Wenrui Yan
Z
Zhengwei Gao
H
Hao Wang
W
Weizhao Jin
Y
Yu Zhang
H
Hainan Zhao
M
Mingliang Zhang
X
Xianxian Xi
Yaru Zhang
Yaru Zhang
W
Wenyuan Li
Z
Zhengguang Gao
Yurui Zhu
Yurui Zhu
University of Science and Technology of China