🤖 AI Summary
Current multimodal large language models (MLLMs) face three key bottlenecks in long-horizon robotic manipulation: insufficient explicit planning capability, limited perception of object affordances, and inability to predict precise manipulation trajectories. To address these, this work formally defines and models the three core capabilities of a “robotic brain” and proposes a unified end-to-end framework built upon an MLLM architecture. The framework integrates robot-specific and general-purpose multimodal data, supports long-video and high-resolution inputs, and introduces ShareRobot—the first high-precision, multi-dimensional annotated dataset for robotic manipulation. A multi-stage training strategy enables joint modeling from abstract task instructions to concrete, executable manipulation trajectories. Evaluated on multiple robotic manipulation benchmarks, our approach achieves state-of-the-art performance, significantly improving task decomposition rationality, affordance understanding accuracy, and trajectory generation fidelity.
📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.