When LLMs get significantly worse: A statistical approach to detect model degradations

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of reliably detecting performance degradation in large language models caused by optimization techniques such as quantization, where observed drops in accuracy may stem from genuine model deterioration or mere evaluation noise. To this end, the authors propose a statistical hypothesis testing framework based on McNemar’s test, which introduces sample-level paired comparisons for the first time in the context of LLM degradation analysis, thereby overcoming the limited sensitivity of conventional task-level aggregation. Integrated with multi-benchmark accuracy aggregation and the LM Evaluation Harness, the method effectively controls false positive rates and reliably identifies performance degradations as small as 0.3%. Empirical results demonstrate that the approach accurately flags models exhibiting true degradation while producing no false alarms for theoretically lossless optimizations.

Technology Category

Application Category

📝 Abstract

Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.

Problem

Research questions and friction points this paper is trying to address.

model degradation

statistical hypothesis testing

LLM optimization

accuracy evaluation

numerical error

Innovation

Methods, ideas, or system contributions that make the work stand out.

model degradation detection

statistical hypothesis testing

McNemar's test