GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open foundational models for general-purpose humanoid robots must address multi-task generalization in real-world scenarios, real-time motion generation, and cross-modal instruction understanding. This paper introduces the first dual-system Vision-Language-Action (VLA) architecture: System 2 (vision-language understanding) and System 1 (diffusion Transformer-based action generation) are jointly trained end-to-end to unify modeling of real robot trajectories, human demonstration videos, and synthetic heterogeneous data. Departing from conventional staged pipelines, our approach enables language-conditioned, high-fidelity, low-latency bimanual manipulation. It significantly outperforms existing imitation learning methods across multiple simulation benchmarks and is successfully deployed on the Fourier GR-1 humanoid robot. With only a few demonstrations, it accomplishes complex tasks—demonstrating strong generalization, robustness, and data efficiency.

Technology Category

Application Category

📝 Abstract
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.
Problem

Research questions and friction points this paper is trying to address.

Develops a versatile foundation model for humanoid robots
Enables robots to handle real-world variability and learn tasks
Improves performance in language-conditioned manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action model for humanoid robots
Dual-system architecture with real-time motor control
Trained on diverse datasets including real-robot trajectories
N
NVIDIA
Johan Bjorck
Johan Bjorck
Cornell University
Computer science
F
Fernando Castaneda
N
Nikita Cherniadev
X
Xingye Da
Runyu Ding
Runyu Ding
The University of Hong Kong
Computer VisionDeep Learning
L
LinxiJimFan
Yu Fang
Yu Fang
Honda Research Institute Japan Co., Ltd.
Human-Robot InteractionEye-head coordinationEye MovementVisual Perception/Cognition
Dieter Fox
Dieter Fox
University of Washington and AI2
RoboticsArtificial IntelligenceComputer Vision
Fengyuan Hu
Fengyuan Hu
Research Engineer, NVIDIA
RoboticsAIMLNLPCognitive science
S
Spencer Huang
Joel Jang
Joel Jang
Research Scientist, Nvidia
Zhenyu Jiang
Zhenyu Jiang
Research, Amazon
Computer visionrobotics
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
K
Kaushil Kundalia
L
Lawrence Lao
Zhiqi Li
Zhiqi Li
PhD, Nanjing University
computer vision
Zongyu Lin
Zongyu Lin
UCLA
Large Foundation ModelPretrainingReasoning
K
Kevin Lin
Guilin Liu
Guilin Liu
Research Scientist, NVIDIA
Computer VisionDeep LearningGenerative Models
E
Edith Llontop
L
Loic Magne
Ajay Mandlekar
Ajay Mandlekar
Research Scientist, NVIDIA
Robot LearningRoboticsMachine LearningArtificial Intelligence
A
Avnish Narayan
Soroush Nasiriany
Soroush Nasiriany
The University of Texas at Austin
Artificial IntelligenceMachine LearningRobotics
Scott Reed
Scott Reed
Research Scientist, NVIDIA Research
Artificial IntelligenceMachine LearningDeep Learning
You Liang Tan
You Liang Tan
berkeley
G
Guanzhi Wang
Z
Zu Wang
J
Jing Wang
Q
Qi Wang
Jiannan Xiang
Jiannan Xiang
University of California, San Diego
Natural Language Processing
Y
Yuqi Xie
Yinzhen Xu
Yinzhen Xu
Peking University
Computer VisionRobotics
Zhenjia Xu
Zhenjia Xu
Columbia University
RoboticsComputer Vision
Seonghyeon Ye
Seonghyeon Ye
KAIST
Machine LearningRobot Learning
Zhiding Yu
Zhiding Yu
Principal Research Scientist & Research Lead, NVIDIA Research
Computer VsionDeep Learning
Ao Zhang
Ao Zhang
Northwestern Polytechnical University
keyword spottingAutomatic Speech Recognition
H
Hao Zhang
Y
Yizhou Zhao
Ruijie Zheng
Ruijie Zheng
University of Maryland, College Park, NVIDIA
Machine LearningReinforcement Learning
Yuke Zhu
Yuke Zhu
The University of Texas at Austin, NVIDIA Research
Robot LearningComputer VisionMachine LearningRoboticsArtificial Intelligence