RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face three key bottlenecks in long-horizon robotic manipulation: insufficient explicit planning capability, limited perception of object affordances, and inability to predict precise manipulation trajectories. To address these, this work formally defines and models the three core capabilities of a “robotic brain” and proposes a unified end-to-end framework built upon an MLLM architecture. The framework integrates robot-specific and general-purpose multimodal data, supports long-video and high-resolution inputs, and introduces ShareRobot—the first high-precision, multi-dimensional annotated dataset for robotic manipulation. A multi-stage training strategy enables joint modeling from abstract task instructions to concrete, executable manipulation trajectories. Evaluated on multiple robotic manipulation benchmarks, our approach achieves state-of-the-art performance, significantly improving task decomposition rationality, affordance understanding accuracy, and trajectory generation fidelity.

Technology Category

Application Category

📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhance robotic brain capabilities for long-horizon manipulation tasks.
Address limitations in Planning, Affordance Perception, and Trajectory Prediction.
Develop a unified model integrating multimodal data for robotic manipulation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed ShareRobot dataset for robotic manipulation
Created RoboBrain model with multi-stage training
Enhanced robotic capabilities using multimodal data
🔎 Similar Papers
No similar papers found.
Yuheng Ji
Yuheng Ji
Institute of Automation, Chinese Academy of Sciences
Embodied AIComputer Vision
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
J
Jiayu Shi
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Y
Yuan Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence
Hengyuan Zhang
Hengyuan Zhang
Ph.D. Student, University of California San Diego
RoboticsComputer VisionAutonomous VehiclesSensor Fusion
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
M
Mengdi Zhao
Beijing Academy of Artificial Intelligence
Y
Yao Mu
The University of Hong Kong
Pengju An
Pengju An
Peking University
AIGC、LLM
X
Xinda Xue
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence
Q
Qinghang Su
Institute of Information Engineering, Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
Huaihai Lyu
Huaihai Lyu
Institute of Automation
multi-modalembodied intelligence
X
Xiaolong Zheng
School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence
Z
Zhongyuan Wang
Beijing Academy of Artificial Intelligence
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models