🤖 AI Summary
Traditional static benchmarks and human/model-based evaluations suffer from overfitting, high cost, and inherent bias. To address these limitations, this work introduces ZeroSumEval—the first dynamic evaluation protocol grounded in zero-sum game theory. It systematically assesses large language models (LLMs) across strategic reasoning, knowledge application, and creativity via seven adversarial game categories—including PyJail, Chess, and MathQuiz. The framework innovatively employs inter-model competition as a driver for dynamic assessment, featuring a standardized, scalable gamified architecture with sandboxed execution, rule-based engines, and automated scoring—enabling over 7,000 cross-model head-to-head experiments. Validation across 13 mainstream LLMs reveals that state-of-the-art models excel at response generation but underperform in crafting novel challenge problems; they also exhibit limited reliable mutual jailbreak capability and constrained creative output.
📝 Abstract
Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with>7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.