🤖 AI Summary
This paper addresses the lack of standardized, reproducible evaluation protocols for procedural content generation (PCG) algorithms. To this end, we introduce the first open-source, multi-task, heterogeneous PCG benchmark—comprising 12 diverse game content generation tasks (e.g., level and rule-set design), each with rigorously defined content representations, controllable parameters, and three-dimensional quantitative metrics: quality, diversity, and controllability. Methodologically, we establish a standardized, multidimensional evaluation framework that integrates baseline generators—including rule-based encoding, random sampling, evolutionary strategies, and genetic algorithms—and propose a multi-objective fitness function. Experimental results reveal significant variations in task difficulty and demonstrate the substantial impact of fitness function design on generative performance. Our work advances standardization, reproducibility, and interpretability in PCG evaluation, providing a unified empirical and theoretical foundation for generative AI research in game development.
📝 Abstract
This paper introduces the Procedural Content Generation Benchmark for evaluating generative algorithms on different game content creation tasks. The benchmark comes with 12 game-related problems with multiple variants on each problem. Problems vary from creating levels of different kinds to creating rule sets for simple arcade games. Each problem has its own content representation, control parameters, and evaluation metrics for quality, diversity, and controllability. This benchmark is intended as a first step towards a standardized way of comparing generative algorithms. We use the benchmark to score three baseline algorithms: a random generator, an evolution strategy, and a genetic algorithm. Results show that some problems are easier to solve than others, as well as the impact the chosen objective has on quality, diversity, and controllability of the generated artifacts.