ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of standardized, scalable, and multidimensional evaluation frameworks for large language models (LLMs), this paper proposes a dynamic zero-sum game evaluation paradigm. Methodologically, we develop a scalable platform supporting heterogeneous competitive games—including CTF, chess, and MathQuiz—and introduce the first unified strategy modeling layer based on DSPy, enabling fair cross-task and cross-model head-to-head benchmarking with automated scoring. We further incorporate fine-grained capability attribution analysis across five dimensions: strategic reasoning, planning, knowledge application, safety, and adaptability. Our contributions are: (1) the first modular, gamified evaluation framework supporting arbitrary new games via plug-and-play integration; and (2) significantly improved assessment reliability, reproducibility, and model discriminability—empirically validated across leading closed- and open-source LLMs.

Technology Category

Application Category

📝 Abstract
We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.
Problem

Research questions and friction points this paper is trying to address.

Develops a competitive framework for evaluating Large Language Models.
Assesses LLM capabilities like reasoning, planning, and safety through games.
Provides a standardized, extensible platform for implementing diverse evaluation games.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Competition-based evaluation framework for LLMs
Diverse suite of games for capability assessment
Standardized, extensible framework with DSPy integration
🔎 Similar Papers
No similar papers found.