Yume-1.5: A Text-Controlled Interactive World Generation Model

๐Ÿ“… 2025-12-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing diffusion-based methods for interactive 3D world generation suffer from excessive parameter counts, high inference step requirements, and unbounded historical context growthโ€”leading to poor real-time performance and limited fine-grained textual control. This paper introduces the first end-to-end, explorable 3D world generation framework supporting single-image or text input and keyboard-driven real-time navigation. Our approach addresses these limitations through three core innovations: (1) a long-video modeling architecture integrating unified context compression with linear attention fusion; (2) a streaming inference mechanism leveraging bidirectional attention distillation and enhanced text embedding guidance; and (3) an event-level, text-guided paradigm for dynamic world evolution. Experiments demonstrate substantial reductions in model parameters and sampling steps, enabling millisecond-scale interactive response while preserving high visual fidelity and ensuring full-text controllability throughout generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Generates interactive worlds from text prompts
Reduces model size and inference steps for real-time performance
Enables text-controlled world event generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-video generation with unified context compression and linear attention
Real-time streaming acceleration via bidirectional attention distillation and enhanced text embedding
Text-controlled method for generating interactive world events
๐Ÿ”Ž Similar Papers
No similar papers found.
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
Z
Zhen Li
Shanghai AI Laboratory
C
Chuanhao Li
Shanghai AI Laboratory
X
Xiaojie Xu
Shanghai AI Laboratory
Kaining Ying
Kaining Ying
Fudan University
T
Tong He
Shanghai AI Laboratory
J
Jiangmiao Pang
Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC