From Code to Play: Benchmarking Program Search for Games Using Large Language Models

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the capability of large language models (LLMs) to directly synthesize executable game code across heterogeneous tasks. Method: We introduce an LLM-driven evolutionary hill-climbing algorithm that integrates program mutation, seed generation, and a dual-language (Python/Java) executable environment, evaluated on 29 diverse tasks—including Atari mini-games, *Baba is You*, and 12 board games from the TAG framework—constituting the first cross-lingual, multi-genre benchmark for game program synthesis. Contribution/Results: Experiments reveal that model performance depends primarily on task characteristics rather than parameter count; individual models exhibit high instability, whereas a “multi-model trial-and-selection” strategy significantly improves success rates (+23.6% on average). Our core contributions are: (1) a reproducible evaluation framework for game code synthesis; (2) empirical validation of task-aware model selection as an effective paradigm; and (3) a novel, practical pathway toward deploying LLMs for real-world program synthesis.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.
Problem

Research questions and friction points this paper is trying to address.

Exploring LLMs for synthesizing game code in Python and Java
Evaluating LLM performance across diverse gaming tasks and models
Assessing cost-quality trade-offs in LLM-generated game solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs synthesize game code in Python and Java
Evolutionary hill-climbing algorithm with LLM-controlled mutations
Evaluated 20 models across 29 diverse gaming tasks
🔎 Similar Papers
No similar papers found.