Evaluation of Large Language Models via Coupled Token Generation

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

LLM response stochasticity induces unstable evaluation outcomes, undermining the reliability of model comparisons. To address this, we propose a causally driven coupled autoregressive generation framework that enables coordinated sampling across multiple models via a shared random seed—explicitly controlling stochastic variation during text generation. This work pioneers the integration of causal modeling into LLM evaluation, uncovering and eliminating the confounding effect of randomness on model ranking. Experiments demonstrate a 40% improvement in sample efficiency on the MMLU benchmark and reveal that conventional evaluation methods suffer up to 23% misranking in model win rates under pairwise comparisons on the LMSYS Arena platform, attributable to stochastic bias. Our approach establishes a new paradigm for reproducible, statistically robust LLM evaluation.

Technology Category

Application Category

📝 Abstract

State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama family. We find that, across multiple knowledge areas from the popular MMLU benchmark dataset, coupled autoregressive generation requires up to 40% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, using data from the LMSYS Chatbot Arena platform, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts differ under coupled and vanilla autoregressive generation.

Problem

Research questions and friction points this paper is trying to address.

Control randomization in language model evaluation

Develop causal model for coupled autoregressive generation

Compare coupled and vanilla autoregressive generation results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coupled token generation method

Same randomness across models

Fewer samples for evaluation

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks