GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the lack of efficient and reliable automated verification mechanisms for games generated by large language models (LLMs), a limitation exacerbated by conventional agent-based approaches that suffer from low coverage, high computational cost, and strong dependence on agent capabilities. To overcome these challenges, the authors propose a novel paradigm based on critical-point decomposition and parallel verification: game specifications are decomposed into verifiable assertions, and independent verification units are constructed via runtime state injection, enabling limited interactions within target states to validate logical correctness. The resulting Ggv-Harness framework supports concurrent scheduling, execution isolation, and fault recovery, substantially enhancing scalability. Evaluated on the VeriGame dataset, the approach achieves a verification accuracy of 92.2%, a significant improvement over the baseline of 58.8%, while reducing verification time by up to 16.6×.

📝 Abstract

LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent's gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated games

automated verification

game correctness

runtime state

keypoint-based verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

keypoint-based verification

runtime state injection

LLM-generated games