π€ AI Summary
This work addresses the lack of a unified foundational platform for cross-morphology robotic policy learning and evaluation. We propose GE-Baseβthe first embodied intelligence platform that unifies policy learning, action decoding, and neural simulation within an instruction-driven video generation framework. Its core innovation lies in modeling spatiotemporal semantic dynamics of robot interaction within a structured latent space, integrating a large-scale instruction-conditioned video diffusion model, a flow-matching action decoder, and a neural action-conditioned simulator. We further introduce EWMBench, an open-source benchmark for standardized evaluation. GE-Base achieves high-fidelity trajectory generation under minimal supervision, attaining state-of-the-art performance in visual realism, physical consistency, and instruction alignment. It supports generalized control across diverse robot morphologies and enables closed-loop training, while providing scalable, standardized assessment capabilities. All models and benchmarks are publicly released.
π Abstract
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.