🤖 AI Summary
To address the lack of standardized, scalable, and multidimensional evaluation frameworks for large language models (LLMs), this paper proposes a dynamic zero-sum game evaluation paradigm. Methodologically, we develop a scalable platform supporting heterogeneous competitive games—including CTF, chess, and MathQuiz—and introduce the first unified strategy modeling layer based on DSPy, enabling fair cross-task and cross-model head-to-head benchmarking with automated scoring. We further incorporate fine-grained capability attribution analysis across five dimensions: strategic reasoning, planning, knowledge application, safety, and adaptability. Our contributions are: (1) the first modular, gamified evaluation framework supporting arbitrary new games via plug-and-play integration; and (2) significantly improved assessment reliability, reproducibility, and model discriminability—empirically validated across leading closed- and open-source LLMs.
📝 Abstract
We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.