$ ext{M}^{ ext{3}}$: A Modular World Model over Streams of Tokens

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited generalization of existing token-based world models in multimodal environments with hybrid continuous/discrete actions, this paper proposes a modular multimodal world model framework. It employs modality-isolated encoders to independently tokenize heterogeneous modalities (e.g., vision and actions), introduces a cross-modal token-flow co-prediction mechanism, and adopts a planning-free architecture. The core contribution is the first fully decoupled, modality-specific component design. Evaluated on Atari 100K, the model achieves median human-normalized performance—matching human-level performance across the benchmark—and surpasses human performance on 13 games. Moreover, it attains state-of-the-art sample efficiency among planning-free world models across multiple benchmarks. The source code and pretrained weights are publicly released.

Technology Category

Application Category

📝 Abstract
Token-based world models emerged as a promising modular framework, modeling dynamics over token streams while optimizing tokenization separately. While successful in visual environments with discrete actions (e.g., Atari games), their broader applicability remains uncertain. In this paper, we introduce $ ext{M}^{ ext{3}}$, a $ extbf{m}$odular $ extbf{w}$orld $ extbf{m}$odel that extends this framework, enabling flexible combinations of observation and action modalities through independent modality-specific components. $ ext{M}^{ ext{3}}$ integrates several improvements from existing literature to enhance agent performance. Through extensive empirical evaluation across diverse benchmarks, $ ext{M}^{ ext{3}}$ achieves state-of-the-art sample efficiency for planning-free world models. Notably, among these methods, it is the first to reach a human-level median score on Atari 100K, with superhuman performance on 13 games. We $href{https://github.com/leor-c/M3}{ ext{open-source our code and weights}}$.
Problem

Research questions and friction points this paper is trying to address.

Extends token-based world models
Enhances agent performance across modalities
Achieves state-of-the-art sample efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular world model
Flexible modality combinations
State-of-the-art efficiency
🔎 Similar Papers
No similar papers found.