V-GameGym: Visual Game Generation for Code Large Language Models

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing programming benchmarks predominantly evaluate algorithmic problem-solving, neglecting critical dimensions essential for visual game development—namely playability, visual aesthetics, and user interactivity. Method: We introduce V-GameGym, the first benchmark tailored for code large language models (Code LLMs) to generate visual games. It comprises 2,219 high-quality samples organized into 100 thematic clusters. We propose a clustering-based sample selection strategy and a multimodal evaluation framework integrating playability, visual quality, and interaction plausibility, with automated assessment conducted in a UI sandbox environment supporting visual code synthesis and execution. Contribution/Results: Experiments demonstrate that V-GameGym significantly improves evaluation validity across interface generation and interactive logic implementation. It effectively bridges the gap between algorithmic reasoning capabilities and practical visual game development requirements, establishing a new standard for assessing Code LLMs in interactive, multimodal software synthesis.

Technology Category

Application Category

📝 Abstract

Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating code LLMs for visual game development beyond syntax accuracy

Bridging the gap between algorithmic coding and practical game requirements

Assessing game-specific metrics like playability and visual aesthetics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal evaluation framework with automated LLM-driven pipeline

Clustering-based curation methodology for diverse game samples

Automated visual code synthesis using UI sandbox environments

🔎 Similar Papers

Grammar-based Game Description Generation using Large Language Models