RynnBrain: Open Embodied Foundation Models

๐Ÿ“… 2026-02-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the absence of a unified, physically grounded multimodal foundation model for embodied intelligence, which hinders coherent perception, reasoning, and planning in real-world spatiotemporal dynamics. To this end, we propose the first unified embodied foundation model architecture that integrates four core capabilities: egocentric understanding, multi-scale spatiotemporal localization, physics-grounded reasoning, and physics-aware planning. The model employs a multi-scale Mixture-of-Experts (MoE) structure (2B/8B/30B-A3B) and task-customized post-training strategies, enabling strong performance across diverse downstream tasksโ€”including navigation, vision-language-action (VLA) tasks, and complex spatial reasoning. Evaluated on 20 embodied benchmarks and 8 general visual understanding benchmarks, our model significantly outperforms existing approaches, demonstrating its effectiveness and adaptability as a general-purpose pretrained backbone for embodied AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.
Problem

Research questions and friction points this paper is trying to address.

embodied intelligence
foundation model
spatiotemporal dynamics
physically grounded reasoning
perception-reasoning-planning integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied intelligence
spatiotemporal foundation model
physically grounded reasoning
physics-aware planning
egocentric understanding
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Ronghao Dang
DAMO Academy, Alibaba Group
Jiayan Guo
Jiayan Guo
Alibaba DAMO Academy, Peking University
LLMMLLMEmbodied AIAgentsRecommender System
Bohan Hou
Bohan Hou
PhD of Computer Science, Carnegie Mellon University
Machine LearningSystems
Sicong Leng
Sicong Leng
Nanyang Technological University
Multi-modal Learning
Kehan Li
Kehan Li
Stanford University
Xin Li
Xin Li
Alibaba Group
natural language processing
J
Jiangpin Liu
DAMO Academy, Alibaba Group
Yunxuan Mao
Yunxuan Mao
Zhejiang University
computer vision robotics
Z
Zhikai Wang
DAMO Academy, Alibaba Group
Yuqian Yuan
Yuqian Yuan
PhD student, Zhejiang University
Computer VisionMachine Learning
M
Minghao Zhu
DAMO Academy, Alibaba Group
X
Xiao Lin
DAMO Academy, Alibaba Group
Y
Yang Bai
DAMO Academy, Alibaba Group
Qian Jiang
Qian Jiang
Northeastern University
ANYTHING I am interested in
Y
Yaxi Zhao
DAMO Academy, Alibaba Group
M
Minghua Zeng
DAMO Academy, Alibaba Group
J
Junlong Gao
DAMO Academy, Alibaba Group
Yuming Jiang
Yuming Jiang
Alibaba DAMO Academy
J
Jun Cen
DAMO Academy, Alibaba Group
Siteng Huang
Siteng Huang
Alibaba DAMO Academy | ZJU | Westlake University
Vision-language ModelsGenerative ModelsEmbodied AI
Liuyi Wang
Liuyi Wang
Tongji University
computer visionnatural language processingartificial intelligence
W
Wenqiao Zhang
DAMO Academy, Alibaba Group
C
Chengju Liu
DAMO Academy, Alibaba Group
Jianfei Yang
Jianfei Yang
Assistant Professor, Director of MARS Lab, Nanyang Technological University
Physical AIEmbodied AIMultimodal AI
Shijian Lu
Shijian Lu
College of Computing and Data Science, NTU
Image and video analyticscomputer visionmachine learning