GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the lack of high-fidelity, scalable closed-loop video simulators in robotic manipulation research by introducing the first closed-loop simulation framework based on action-conditional video generation. Trained on large-scale real-world robot data, the system integrates latent-space state decoding, task-oriented reward evaluation, and frame-skipping modules to accelerate inference, enabling end-to-end policy learning and evaluation. Built upon a 2-billion-parameter Transformer architecture, the framework achieves state-of-the-art performance on the WorldArena benchmark. Policies trained in simulation demonstrate significantly improved transfer to real-world environments, and the system generates high-quality 25-frame video sequences in just 2.3 seconds on a single H100 GPU—marking the first complete closed loop from video-based simulation to real-world policy deployment.

📝 Abstract

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.

Problem

Research questions and friction points this paper is trying to address.

video world simulator

robotic manipulation

closed-loop learning

action-conditioned video generation

policy evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-loop simulation

video world model

robotic manipulation