V-GameGym: Visual Game Generation for Code Large Language Models

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing programming benchmarks predominantly evaluate algorithmic problem-solving, neglecting critical dimensions essential for visual game development—namely playability, visual aesthetics, and user interactivity. Method: We introduce V-GameGym, the first benchmark tailored for code large language models (Code LLMs) to generate visual games. It comprises 2,219 high-quality samples organized into 100 thematic clusters. We propose a clustering-based sample selection strategy and a multimodal evaluation framework integrating playability, visual quality, and interaction plausibility, with automated assessment conducted in a UI sandbox environment supporting visual code synthesis and execution. Contribution/Results: Experiments demonstrate that V-GameGym significantly improves evaluation validity across interface generation and interactive logic implementation. It effectively bridges the gap between algorithmic reasoning capabilities and practical visual game development requirements, establishing a new standard for assessing Code LLMs in interactive, multimodal software synthesis.

Technology Category

Application Category

📝 Abstract
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code LLMs for visual game development beyond syntax accuracy
Bridging the gap between algorithmic coding and practical game requirements
Assessing game-specific metrics like playability and visual aesthetics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal evaluation framework with automated LLM-driven pipeline
Clustering-based curation methodology for diverse game samples
Automated visual code synthesis using UI sandbox environments
🔎 Similar Papers
No similar papers found.
W
Wei Zhang
Shanghai AI Lab
Jack Yang
Jack Yang
Senior Lecturer, University of New South Wales
Computational Material Science
R
Renshuai Tao
Beijing Jiaotong University
L
Lingzheng Chai
S
Shawn Guo
J
Jiajun Wu
X
Xiaoming Chen
AIStrong
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
N
Ning Ding
Shanghai AI Lab
X
Xander Xu
Alibaba Group
H
Hu Wei
Alibaba Group
B
Bowen Zhou
Shanghai AI Lab