๐ค AI Summary
This work addresses the limitation of existing game generation methods, which typically produce code in a single pass and struggle to identify playability issues at the interaction level. To overcome this, the authors propose Play2Code, a continuous generation framework integrated with PlaytestArena, an evaluation environment that, for the first time, deeply incorporates browser-based GUI agents into the game creation pipeline. This integration establishes a closed loop of โgenerateโplayโfeedback,โ leveraging a shared memory mechanism, behavioral scoring rules, and interactive code generation to substantially enhance game playability. Experimental results demonstrate that Play2Code achieves a rule compliance rate of 66.8%, representing improvements of 37.1 and 14.6 percentage points over one-shot generation and state-of-the-art agent-based coding baselines, respectively.
๐ Abstract
Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.