Pixels to Play: A Foundation Model for 3D Gameplay

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of building generalist agents for 3D video games by proposing the first end-to-end trained foundation model that operates directly on raw pixel inputs, enabling cross-game, low-latency, human-behavior-guided control. Methodologically, it employs a decoder-only Transformer architecture jointly optimized for behavior cloning and inverse dynamics modeling, leveraging both labeled gameplay data and large-scale unlabeled game videos to autoregressively generate action sequences. Key contributions include: (1) achieving playable cross-platform performance (e.g., Roblox and MS-DOS) without environment interaction or reinforcement learning—relying solely on visual observations; (2) empirically validating the utility of unlabeled game videos for action inference; and (3) establishing a foundation for text-conditioned expert-level control. Experiments demonstrate applicability to emerging use cases such as AI teammates and controllable non-player characters (NPCs).

Technology Category

Application Category

📝 Abstract
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
Problem

Research questions and friction points this paper is trying to address.

Develops foundation model for playing diverse 3D video games
Uses pixel input instead of game-specific engineering
Achieves human-like gameplay through behavior cloning training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pixel stream input for gameplay
Trains with behavior cloning and unlabeled videos
Employs decoder-only transformer for action output
🔎 Similar Papers
No similar papers found.
Yuguang Yue
Yuguang Yue
Amazon
Bayesian StatisticsReinforcement Learning
C
Chris Green
Player2
S
Samuel Hunt
Player2
I
Irakli Salia
Player2
W
Wenzhe Shi
Player2
Jonathan J Hunt
Jonathan J Hunt
Twitter
Machine LearningRecommender Systems