🤖 AI Summary
To address the fragmentation of perception, reasoning, and planning in embodied agents operating within complex physical environments, this paper introduces a unified embodied vision-language foundation model based on a heterogeneous dual-scale architecture (7B/32B), respectively optimized for spatial understanding and long-horizon decision-making. The model integrates a visual encoder with a large language model and employs multi-stage training, large-scale embodied interaction data curation, and an efficient training infrastructure to achieve end-to-end co-optimization. The 32B variant significantly outperforms existing open- and closed-source methods on spatial reasoning benchmarks (e.g., Ego4D-Action, Habitat-Nav) and temporal planning benchmarks (e.g., BEHAVIOR, ALFRED). Crucially, it is the first unified framework to support closed-loop interaction, multi-step physically constrained planning, and task execution over horizons exceeding 100 steps—establishing a scalable foundation model paradigm for general-purpose embodied intelligence.
📝 Abstract
We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction, multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.