Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work investigates the impact of model scale on data-efficient generalist Transformer world models evaluated on the Atari 100k benchmark, disentangling the effects of scale from architectural mechanisms. Using a minimalist Transformer-based world model, the study reveals for the first time distinct scaling behaviors across Atari environments. It further demonstrates that joint training across multiple environments unifies and stabilizes scaling gains, overcoming the limitations of single-task scaling. Empirical results show that under joint training, all environments exhibit monotonic performance improvements, with policies trained via learned simulated dynamics achieving a median human-normalized score of 0.770.

📝 Abstract

Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emph{scale}. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

Problem

Research questions and friction points this paper is trying to address.

world models

model scale

data efficiency

Atari benchmark

scaling regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

world models

model scaling

data efficiency