Yume: An Interactive World Generation Model

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating interactive, high-fidelity dynamic virtual worlds from diverse inputs (images, text, or videos) while enabling real-time exploration and control via peripherals or neural signals. Methodologically, it introduces the first camera-motion-quantized modeling framework and Masked Video Diffusion Transformer (MVDT), augmented with a training-free anti-artifact mechanism (AAM) and time-travel sampling via stochastic differential equations (TTS-SDE). To enhance temporal coherence, it integrates memory caching and adversarial distillation. Trained on the Sekai dataset, the model synthesizes infinitely extendable, physically coherent high-definition video sequences from a single input image, achieving millisecond-level interactive latency. The framework enables immersive, embodied interaction and supports brain-computer interface (BCI)-driven world generation. Code and models are publicly released, establishing a novel paradigm for embodied AI and neuro-controlled dynamic world synthesis.

Technology Category

Application Category

📝 Abstract
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset sekai to train method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
Problem

Research questions and friction points this paper is trying to address.

Create interactive dynamic world from images text videos
Enable exploration control via devices neural signals
Develop framework for high-fidelity video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Camera motion quantization for stable training
Masked Video Diffusion Transformer with memory
Anti-Artifact Mechanism and Time Travel Sampling
🔎 Similar Papers
No similar papers found.
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
S
Shaoheng Lin
Shanghai AI Laboratory
Z
Zhen Li
Shanghai AI Laboratory
C
Chuanhao Li
Shanghai AI Laboratory
W
Wenshuo Peng
Shanghai AI Laboratory
T
Tong He
Shanghai AI Laboratory
J
Jiangmiao Pang
Shanghai AI Laboratory
Mingmin Chi
Mingmin Chi
Fudan University
Data scienceBig dataRemote sensingFinanceMachine learning
Y
Yu Qiao
Shanghai AI Laboratory
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC