LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing LLM evaluation methods struggle to capture subjective, open-ended, and fine-grained behavioral characteristics. To address this, we propose the first automated pairwise evaluation framework integrating game-theoretic aggregation, self-play among LLMs, and human validation: LLMs conduct pairwise preference judgments to construct a preference graph; game-theoretic voting rules (e.g., Copeland or Borda) aggregate these judgments into global rankings; and consistency between model-derived rankings and human votes is systematically measured. Our key contribution lies in unifying three components—mutual evaluation, mechanism-based aggregation, and interpretable validation grounded in real human preferences—thereby establishing a human-aligned evaluation paradigm. Experiments show strong rank correlation between LLM self-evaluations and human preferences (Spearman’s ρ > 0.7), yet interpretable discrepancies persist in semantically subtle or value-laden tasks, revealing both the promise and fundamental limitations of automated assessment—and pointing to concrete avenues for refinement.

Technology Category

Application Category

📝 Abstract

Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Applying game theory to evaluate large language models

Addressing limitations of conventional LLM evaluation methods

Investigating alignment between model-generated rankings and human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs assess each other through mutual evaluation

Game-theoretic voting algorithms aggregate peer reviews

Systematic comparison with human voting validates alignment

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks