ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the lack of standardized, scalable, and multidimensional evaluation frameworks for large language models (LLMs), this paper proposes a dynamic zero-sum game evaluation paradigm. Methodologically, we develop a scalable platform supporting heterogeneous competitive games—including CTF, chess, and MathQuiz—and introduce the first unified strategy modeling layer based on DSPy, enabling fair cross-task and cross-model head-to-head benchmarking with automated scoring. We further incorporate fine-grained capability attribution analysis across five dimensions: strategic reasoning, planning, knowledge application, safety, and adaptability. Our contributions are: (1) the first modular, gamified evaluation framework supporting arbitrary new games via plug-and-play integration; and (2) significantly improved assessment reliability, reproducibility, and model discriminability—empirically validated across leading closed- and open-source LLMs.

Technology Category

Application Category

📝 Abstract

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

Problem

Research questions and friction points this paper is trying to address.

Develops a competitive framework for evaluating Large Language Models.

Assesses LLM capabilities like reasoning, planning, and safety through games.

Provides a standardized, extensible platform for implementing diverse evaluation games.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Competition-based evaluation framework for LLMs

Diverse suite of games for capability assessment

Standardized, extensible framework with DSPy integration

🔎 Similar Papers

No similar papers found.