Solaris: Building a Multiplayer Video World Model in Minecraft

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video world models, which are predominantly confined to single-agent perspectives and struggle to capture multi-agent interactions and cross-view consistency in realistic scenarios. The authors propose Solaris, the first video world model supporting collaborative multi-player environments, enabled by a custom Minecraft-based multi-agent data collection system that synchronizes video observations with agent actions. A staged training strategy progressively scales modeling from single-player to multi-player settings. Solaris introduces novel mechanisms such as Checkpointed Self Forcing and integrates bidirectional, causal, and self-forcing training paradigms to efficiently model long-horizon sequences. Trained on 12.64 million frames of multi-player interaction data, Solaris substantially outperforms current baselines, demonstrating superior performance in multi-view consistency and complex interactive tasks. The code and model are publicly released.

Technology Category

Application Category

📝 Abstract
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Problem

Research questions and friction points this paper is trying to address.

multi-agent interaction
video world models
multiplayer video generation
multi-view consistency
action-conditioned video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multiplayer video world model
multi-agent data collection
Checkpointed Self Forcing
view consistency
staged training pipeline
🔎 Similar Papers
No similar papers found.
G
Georgy Savva
New York University
Oscar Michel
Oscar Michel
NYU
Deep LearningComputer VisionComputer Graphics
D
Daohan Lu
New York University
S
Suppakit Waiwitlikhit
New York University
T
Timothy Meehan
New York University
D
Dhairya Mishra
New York University
S
Srivats Poddar
New York University
Jack Lu
Jack Lu
New York University
Machine LearningDeep LearningGenerative Modeling
Saining Xie
Saining Xie
Courant Institute, New York University
computer visionmachine learningrepresentation learningartificial intelligence