🤖 AI Summary
This work addresses the challenge of building generalist agents capable of cross-game generalization. The authors introduce the first internet-scale dataset of gameplay videos paired with player actions, comprising over 40,000 hours of human gameplay, and leverage automated action annotation alongside large-scale behavioral cloning to train a unified vision-action foundation model. To systematically evaluate cross-game generalization, they propose a multi-game simulation benchmark and release both the model and dataset publicly. Experimental results demonstrate that the trained model significantly outperforms baselines trained from scratch on unseen games, achieving up to a 52% improvement in task success rates. The approach effectively generalizes across diverse game genres—including 3D action, 2D platformers, and procedurally generated worlds—advancing the development of general-purpose embodied agents.
📝 Abstract
We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.