The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses key limitations in current evaluations of large language models’ reasoning capabilities—namely, the high cost of human-authored benchmarks and susceptibility to training data contamination. Inspired by 16th-century mathematical duels, the authors propose a self-play evaluation framework that requires no human intervention: models alternately generate and solve programming puzzles expressed as Python Boolean functions, with their relative performance quantified via Elo ratings derived from adversarial matches. This approach uniquely integrates both puzzle generation and solving into a unified reasoning assessment, enabling dynamic benchmark expansion while mitigating saturation. Experiments across ten state-of-the-art models demonstrate strong alignment between the resulting rankings and established human-curated benchmarks such as Humanity's Last Exam, while also exposing a significant deficiency in current models’ ability to produce high-quality puzzles.

Technology Category

Application Category

📝 Abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Problem

Research questions and friction points this paper is trying to address.

language model evaluation

reasoning capability

puzzle generation

benchmarking

model comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Games

model self-evaluation

programming puzzles