ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Traditional static benchmarks and human/model-based evaluations suffer from overfitting, high cost, and inherent bias. To address these limitations, this work introduces ZeroSumEval—the first dynamic evaluation protocol grounded in zero-sum game theory. It systematically assesses large language models (LLMs) across strategic reasoning, knowledge application, and creativity via seven adversarial game categories—including PyJail, Chess, and MathQuiz. The framework innovatively employs inter-model competition as a driver for dynamic assessment, featuring a standardized, scalable gamified architecture with sandboxed execution, rule-based engines, and automated scoring—enabling over 7,000 cross-model head-to-head experiments. Validation across 13 mainstream LLMs reveals that state-of-the-art models excel at response generation but underperform in crafting novel challenge problems; they also exhibit limited reliable mutual jailbreak capability and constrained creative output.

Technology Category

Application Category

📝 Abstract

Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with>7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs with dynamic benchmarks to avoid overfitting

Evaluating AI capabilities like reasoning and creativity via games

Standardizing game-based tests for diverse model comparisons

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-sum games for dynamic LLM evaluation

Diverse game suite testing multiple AI capabilities

Standardized extensible framework for model competition

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)