🤖 AI Summary
Existing programming benchmarks predominantly evaluate algorithmic problem-solving, neglecting critical dimensions essential for visual game development—namely playability, visual aesthetics, and user interactivity.
Method: We introduce V-GameGym, the first benchmark tailored for code large language models (Code LLMs) to generate visual games. It comprises 2,219 high-quality samples organized into 100 thematic clusters. We propose a clustering-based sample selection strategy and a multimodal evaluation framework integrating playability, visual quality, and interaction plausibility, with automated assessment conducted in a UI sandbox environment supporting visual code synthesis and execution.
Contribution/Results: Experiments demonstrate that V-GameGym significantly improves evaluation validity across interface generation and interactive logic implementation. It effectively bridges the gap between algorithmic reasoning capabilities and practical visual game development requirements, establishing a new standard for assessing Code LLMs in interactive, multimodal software synthesis.
📝 Abstract
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.