🤖 AI Summary
Existing benchmarks inadequately assess the general reasoning capabilities of large language models (LLMs), particularly in open-ended, dynamic decision-making contexts. Method: We introduce gg-bench—the first automated benchmark supporting infinite-generation and dynamically evolving unknown games. It leverages state-of-the-art LLMs (e.g., GPT-4o, Claude 3.7, o1) to autonomously design novel game rules and generate Gym-compatible environments; trains adversarial agents via self-play reinforcement learning; and evaluates real-time reasoning and decision-making through in-context learning and action-sequence prompting—establishing a closed-loop assessment pipeline. Contribution/Results: Our key innovation lies in unifying the LLM as game designer, environment implementer, and evaluation collaborator, enabling autonomous benchmark growth and generalization validation. Experiments show that SOTA general-purpose models achieve only 7–9% win rates on gg-bench, while specialized reasoning models reach 31–36%. All games, data pipelines, and evaluation code are publicly released, establishing a new paradigm for quantifying general intelligence.
📝 Abstract
We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.