Brittlebench: Quantifying LLM robustness via prompt sensitivity

📅 2026-02-27

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current evaluations of large language models predominantly rely on static, clean benchmarks that fail to capture model robustness against real-world user inputs containing noise, typos, or semantically equivalent yet lexically diverse prompts. This work proposes a theoretical framework that decomposes model performance variance into contributions from data difficulty and prompt variation, and introduces BrittleBench—an evaluation pipeline that systematically assesses robustness by generating prompt variants through semantics-preserving textual perturbations. Experiments reveal that such perturbations can degrade model performance by up to 12%, alter model rankings in 63% of cases, and account for as much as 50% of performance variance in certain models. These findings demonstrate that state-of-the-art models are highly sensitive to minor prompt alterations, underscoring the critical need for dynamic, robustness-oriented evaluation protocols.

Technology Category

Application Category

📝 Abstract

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.

Problem

Research questions and friction points this paper is trying to address.

LLM robustness

prompt sensitivity

evaluation benchmark

input perturbation

model brittleness

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt sensitivity

model brittleness

robustness evaluation