Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the lack of a unified foundational platform for cross-morphology robotic policy learning and evaluation. We propose GE-Base—the first embodied intelligence platform that unifies policy learning, action decoding, and neural simulation within an instruction-driven video generation framework. Its core innovation lies in modeling spatiotemporal semantic dynamics of robot interaction within a structured latent space, integrating a large-scale instruction-conditioned video diffusion model, a flow-matching action decoder, and a neural action-conditioned simulator. We further introduce EWMBench, an open-source benchmark for standardized evaluation. GE-Base achieves high-fidelity trajectory generation under minimal supervision, attaining state-of-the-art performance in visual realism, physical consistency, and instruction alignment. It supports generalized control across diverse robot morphologies and enables closed-loop training, while providing scalable, standardized assessment capabilities. All models and benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Integrates policy learning, evaluation, and simulation in robotics

Maps latent representations to executable robotic action trajectories

Provides scalable evaluation and training for embodied intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion model for robotic interactions

Flow-matching decoder for action trajectories

Neural simulator for policy development

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey