Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

πŸ“… 2025-08-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing interactive world models rely on bidirectional attention and multi-step reasoning, resulting in poor real-time performance and inability to support instantaneous responses to dynamic real-world environments. This work introduces the first interactive world model capable of real-time streaming generation: it employs a causal attention architecture, frame-level action injection, and few-step autoregressive diffusion distillation to drastically reduce inference steps while preserving physical plausibility. We establish a scalable data pipeline using Unreal Engine and GTA5, generating 1,200 hours of video with fine-grained action annotations. The model is open-sourced and achieves 25 FPS interactive long-video generation (minute-scale high-fidelity output) with millisecond-level action-to-response latencyβ€”the first such demonstration. This work breaks the longstanding trade-off among real-time capability, physical fidelity, and deployment feasibility in interactive world modeling.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
Problem

Research questions and friction points this paper is trying to address.

Real-time interactive video generation with minimal latency
Overcoming slow inference in diffusion-based world models
Enabling dynamic, action-responsive streaming video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-step auto-regressive diffusion for real-time video
Scalable data pipeline with diverse interaction annotations
Action injection module for frame-level inputs
πŸ”Ž Similar Papers
No similar papers found.