🤖 AI Summary
This work addresses the challenge of precise control over agent actions and camera motion in controllable game-world video generation—where existing methods often compromise visual quality and temporal coherence. We propose the first two-stage interactive video generation framework: (1) self-supervised pretraining on 2,700 hours of unlabeled Minecraft gameplay videos, followed by (2) fine-tuning on 1,000 hours of keyboard-mouse action-annotated data to enable fine-grained action-conditioned modeling. The 17B-parameter model jointly conditions on reference images, motion context, and real-time user inputs. We introduce GameWorld Score, a unified benchmark for evaluating action controllability, physical plausibility, and visual fidelity. Our method significantly outperforms baselines—including Oasis and MineWorld—across all dimensions. Double-blind human evaluation further confirms its superior spatiotemporal consistency and high-fidelity video generation capability.
📝 Abstract
We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.