BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current language model benchmarks often suffer from coarse-grained metadata, making it difficult to accurately assess their coverage of capabilities that matter to users. To address this limitation, this work proposes a fine-grained retrieval system based on natural language queries that precisely identifies evaluation items relevant to real-world usage scenarios across 20 mainstream benchmarks. For the first time, the system leverages interpretable retrieval evidence to expose gaps between benchmark content and user intent. It further enables transparent validation of benchmark validity through human evaluation combined with analyses of content validity and construct validity. Human assessment confirms that the method achieves high retrieval precision and effectively uncovers issues such as insufficient capability coverage or unstable scoring.

Technology Category

Application Category

📝 Abstract

Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a"poetry"benchmark may never test for haikus, while"instruction-following"benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

Problem

Research questions and friction points this paper is trying to address.

benchmark validity

content validity

convergent validity

language model evaluation

practitioner intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

BenchBrowser

benchmark validity

retrieval system