๐ค AI Summary
Current evaluations of large language models predominantly rely on static, clean benchmarks that fail to capture model robustness against real-world user inputs containing noise, typos, or semantically equivalent yet lexically diverse prompts. This work proposes a theoretical framework that decomposes model performance variance into contributions from data difficulty and prompt variation, and introduces BrittleBenchโan evaluation pipeline that systematically assesses robustness by generating prompt variants through semantics-preserving textual perturbations. Experiments reveal that such perturbations can degrade model performance by up to 12%, alter model rankings in 63% of cases, and account for as much as 50% of performance variance in certain models. These findings demonstrate that state-of-the-art models are highly sensitive to minor prompt alterations, underscoring the critical need for dynamic, robustness-oriented evaluation protocols.
๐ Abstract
Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.