๐ค AI Summary
This work addresses the absence of a unified, physically grounded multimodal foundation model for embodied intelligence, which hinders coherent perception, reasoning, and planning in real-world spatiotemporal dynamics. To this end, we propose the first unified embodied foundation model architecture that integrates four core capabilities: egocentric understanding, multi-scale spatiotemporal localization, physics-grounded reasoning, and physics-aware planning. The model employs a multi-scale Mixture-of-Experts (MoE) structure (2B/8B/30B-A3B) and task-customized post-training strategies, enabling strong performance across diverse downstream tasksโincluding navigation, vision-language-action (VLA) tasks, and complex spatial reasoning. Evaluated on 20 embodied benchmarks and 8 general visual understanding benchmarks, our model significantly outperforms existing approaches, demonstrating its effectiveness and adaptability as a general-purpose pretrained backbone for embodied AI.
๐ Abstract
Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.