🤖 AI Summary
Traditional evaluation based solely on compilation success is misleading in the context of multi-component, domain-specific executable game generation, as it fails to capture functional correctness and structural fidelity. This work proposes Mage, a four-axis evaluation protocol that systematically assesses Unity game scenes generated by large language models along the dimensions of compilation success, runtime execution, structural fidelity, and adherence to game mechanics. The study reveals, for the first time, a negative correlation between compilation success rate and functional correctness in this task. It further demonstrates that conditioning on an intermediate representation (IR) is crucial for enhancing structural fidelity: direct generation achieves a 43% runtime success rate but only a mechanism F1 score of 0.12, whereas IR-augmented generation boosts the F1 score to 1.00—even though the runtime success rate drops by half—highlighting the necessity of multi-axis evaluation to uncover nuanced quality differences.
📝 Abstract
Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.