Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

πŸ“… 2025-08-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of a unified foundational platform for cross-morphology robotic policy learning and evaluation. We propose GE-Baseβ€”the first embodied intelligence platform that unifies policy learning, action decoding, and neural simulation within an instruction-driven video generation framework. Its core innovation lies in modeling spatiotemporal semantic dynamics of robot interaction within a structured latent space, integrating a large-scale instruction-conditioned video diffusion model, a flow-matching action decoder, and a neural action-conditioned simulator. We further introduce EWMBench, an open-source benchmark for standardized evaluation. GE-Base achieves high-fidelity trajectory generation under minimal supervision, attaining state-of-the-art performance in visual realism, physical consistency, and instruction alignment. It supports generalized control across diverse robot morphologies and enables closed-loop training, while providing scalable, standardized assessment capabilities. All models and benchmarks are publicly released.

Technology Category

Application Category

πŸ“ Abstract
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

Integrates policy learning, evaluation, and simulation in robotics
Maps latent representations to executable robotic action trajectories
Provides scalable evaluation and training for embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion model for robotic interactions
Flow-matching decoder for action trajectories
Neural simulator for policy development
πŸ”Ž Similar Papers
No similar papers found.
Yue Liao
Yue Liao
National University of Singapore
Computer VisionDeep LearningMLLM
P
Pengfei Zhou
AgiBot Genie Team
S
Siyuan Huang
AgiBot Genie Team
D
Donglin Yang
AgiBot Genie Team
Shengcong Chen
Shengcong Chen
Unknown affiliation
World ModelComputer VisionEmbodied AIMedical Image Analysis
Y
Yuxin Jiang
AgiBot Genie Team
Y
Yue Hu
AgiBot Genie Team
J
Jingbin Cai
AgiBot Genie Team
Si Liu
Si Liu
Fred Hutchinson Cancer Center
GenomicsBiostatisticsAnomaly DetectionOpen Category Detection
Jianlan Luo
Jianlan Luo
UC Berkeley, Google X
RoboticsMachine LearningArtificial Intelligence
L
Liliang Chen
AgiBot Genie Team
S
Shuicheng Yan
NUS LV-Lab
Maoqing Yao
Maoqing Yao
Google
G
Guanghui Ren
AgiBot Genie Team