🤖 AI Summary
This work addresses the limitations of conventional game engines in generating high-fidelity long-horizon trajectories and simulating dynamic environments in real time. To this end, we introduce GameNGen—the first end-to-end neural, interactive game engine. Leveraging DOOM gameplay data, we pioneer the use of a conditional diffusion model as a differentiable, autoregressive engine core, enabling real-time 20 FPS frame prediction and dynamic environment generation. We propose a novel conditional augmentation mechanism and decoder fine-tuning strategy, significantly improving stability and visual-semantic fidelity over minute-long sequences (PSNR: 29.4). GameNGen achieves real-time inference on a single TPU, with human evaluators distinguishing its outputs from ground truth at only ≈51% accuracy—comparable to JPEG-compressed video. This work establishes a new paradigm for neural rendering and embodied intelligence.
📝 Abstract
We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.