Matrix-Game: Interactive World Foundation Model

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precise control over agent actions and camera motion in controllable game-world video generation—where existing methods often compromise visual quality and temporal coherence. We propose the first two-stage interactive video generation framework: (1) self-supervised pretraining on 2,700 hours of unlabeled Minecraft gameplay videos, followed by (2) fine-tuning on 1,000 hours of keyboard-mouse action-annotated data to enable fine-grained action-conditioned modeling. The 17B-parameter model jointly conditions on reference images, motion context, and real-time user inputs. We introduce GameWorld Score, a unified benchmark for evaluating action controllability, physical plausibility, and visual fidelity. Our method significantly outperforms baselines—including Oasis and MineWorld—across all dimensions. Double-blind human evaluation further confirms its superior spatiotemporal consistency and high-fidelity video generation capability.

Technology Category

Application Category

📝 Abstract
We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.
Problem

Research questions and friction points this paper is trying to address.

Develops controllable game world generation model
Trains model for interactive video generation
Evaluates performance with unified benchmark metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training pipeline for game world generation
Controllable image-to-world generation with 17B parameters
GameWorld Score benchmark for performance evaluation
🔎 Similar Papers
No similar papers found.
Y
Yifan Zhang
Skywork AI
C
Chunli Peng
Skywork AI
B
Boyang Wang
Skywork AI
Puyi Wang
Puyi Wang
CUHK CSE PhD
Q
Qingcheng Zhu
Skywork AI
Fei Kang
Fei Kang
Department of Nuclear Medicine, Xijing Hospital
Lung Cancer PET/CT ImagingMultimodality Molecular ImagingOptical Imaging
Biao Jiang
Biao Jiang
Peking University
Computer vision
Z
Zedong Gao
Skywork AI
E
Eric Li
Skywork AI
Y
Yang Liu
Skywork AI
Y
Yahui Zhou
Skywork AI