SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing large language model (LLM) evaluation methods rely heavily on domain experts, manual annotation, and labeled datasets, resulting in poor scalability and subjectivity. Method: This paper introduces SKATE—a fully automated, human-free, and label-free scalable evaluation framework. SKATE models compete by mutually generating and solving verifiable tasks (e.g., code output prediction), formalizing evaluation as a game-theoretic process. It leverages self-preference mechanisms to uncover capability biases and integrates the TrueSkill rating system for fine-grained ability differentiation. Contribution/Results: SKATE is the first framework to empirically demonstrate that weaker models can reliably distinguish stronger ones—a counterintuitive yet robust finding. Extensive experiments across six state-of-the-art LLMs validate its effectiveness, offering a general, objective, and scalable paradigm for AI evaluation without human supervision or ground-truth labels.

Technology Category

Application Category

📝 Abstract

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

Problem

Research questions and friction points this paper is trying to address.

Evaluating foundation models' capabilities and risks scalably

Automating model evaluation without human input or expertise

Differentiating model strengths via verifiable, open-ended challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs compete via verifiable task generation

Automated, scalable, objective evaluation framework

TrueSkill ranking for model differentiation

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates