The Evaluation Game: Beyond Static LLM Benchmarking

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current safety fine-tuning of large language models lacks theoretical grounding, and static benchmarks fail to capture their true robustness against adversarial jailbreak attacks. This work introduces, for the first time, a two-player game-theoretic framework grounded in group actions to model the interaction between evaluators and trainers. It formalizes data augmentation as a group action and interprets benchmarks as orbits, enabling dynamic assessment of model generalization. Theoretical analysis reveals that fine-tuning yields only local generalization. Experiments on Llama, Qwen, and Mistral demonstrate that refusal rates strongly correlate with the distance from fine-tuning samples, indicating that static benchmarks can be easily misled by memorization-based patching. In contrast, the proposed dynamic evaluation paradigm effectively discerns genuine improvements in robustness.

📝 Abstract

As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.

Problem

Research questions and friction points this paper is trying to address.

jailbreaks

robustness fine-tuning

adversarial evaluation

static benchmarking

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

game-theoretic framework

group actions

adversarial evaluation