QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge of efficiently identifying questions that large language models (LLMs) struggle to answer within dynamic evaluation benchmarks, where high assessment costs and noise impede reliable detection. The authors propose QuickScope, a novel approach that introduces the enhanced Bayesian optimization algorithm COUP to the dynamic LLM evaluation setting for the first time. By actively exploring the question space and supporting user-defined objectives—such as low accuracy or anomalous difficulty—QuickScope enables precise identification of genuinely hard questions. The method integrates a flexible dataset interface with configurable utility functions, substantially improving sampling efficiency and mitigating misjudgments caused by noise. Experimental results demonstrate that QuickScope consistently outperforms standard baselines across multiple dynamic benchmarks, achieving higher efficacy in discovering truly difficult questions while reducing false positives.

Technology Category

Application Category

📝 Abstract

LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive -- especially when the goal is not just determining an average score, but reliably identifying a model's weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez & Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually hard relative to their measured complexity). In experiments across a range of benchmarks, we show that our method, dubbed $\texttt{QuickScope}$, discovers truly difficult questions more sample efficiently than standard baselines, while also reducing false positives from noisy outcomes.

Problem

Research questions and friction points this paper is trying to address.

dynamic LLM benchmarks

hard question identification

model weakness detection

sample-efficient evaluation

noisy outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic LLM benchmarks

Bayesian optimization

hard question identification