AutoBencher: Towards Declarative Benchmark Construction

📅 2024-07-11

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the inefficiency and lack of targeting in benchmark construction for evaluating language model capabilities and safety. We propose AutoBencher—the first declarative framework for automatic benchmark generation. It formalizes benchmark design as a multi-objective optimization problem, jointly optimizing for difficulty, topic salience, and safety constraints, while leveraging large language models to iteratively generate and refine dataset specifications. Key contributions include: (1) introducing the first declarative paradigm for benchmark construction; (2) enabling fine-grained, goal-driven customization; and (3) pioneering targeted discovery of tail-knowledge deficiencies and specific safety refusal failures. AutoBencher constructs novel benchmarks across mathematics, multilingual understanding, factual knowledge, and safety—increasing model error rates by 22%. It successfully uncovers Gemini-Pro’s knowledge gaps in paleontology and GPT-4o’s failure to refuse cryptocurrency scam requests.

Technology Category

Application Category

📝 Abstract

We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.

Problem

Research questions and friction points this paper is trying to address.

Automates benchmark creation for language model evaluation.

Optimizes datasets to reveal novel performance and safety vulnerabilities.

Generates challenging datasets that increase model error rates by 22%.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Declarative framework for automatic benchmark construction

Optimization problem for dataset creation using language models

Scalable testing of fine-grained categories and tail knowledge

🔎 Similar Papers

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization