🤖 AI Summary
This work addresses the challenge of building generalist agents for 3D video games by proposing the first end-to-end trained foundation model that operates directly on raw pixel inputs, enabling cross-game, low-latency, human-behavior-guided control. Methodologically, it employs a decoder-only Transformer architecture jointly optimized for behavior cloning and inverse dynamics modeling, leveraging both labeled gameplay data and large-scale unlabeled game videos to autoregressively generate action sequences. Key contributions include: (1) achieving playable cross-platform performance (e.g., Roblox and MS-DOS) without environment interaction or reinforcement learning—relying solely on visual observations; (2) empirically validating the utility of unlabeled game videos for action inference; and (3) establishing a foundation for text-conditioned expert-level control. Experiments demonstrate applicability to emerging use cases such as AI teammates and controllable non-player characters (NPCs).
📝 Abstract
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.