🤖 AI Summary
Existing evaluation benchmarks struggle to simultaneously assess software development complexity and multimodal understanding, hindering the advancement of multimodal agents. To address this gap, this work proposes GameDevBench—the first multimodal agent benchmark centered on game development—comprising 132 complex tasks derived from real-world tutorials that require agents to jointly manage large codebases and diverse multimodal assets such as shaders, sprites, and animations. By introducing game development as a novel evaluation domain, this study reveals a strong correlation between task difficulty and multimodal complexity and incorporates a lightweight image- and video-based feedback mechanism to enhance agent comprehension. Experimental results show that even state-of-the-art models solve only 54.5% of the tasks; notably, with the proposed feedback, Claude Sonnet 4.5 achieves a performance gain on 2D graphics tasks, increasing success rates from 33.3% to 47.7%, thereby validating the effectiveness of the approach.
📝 Abstract
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.