The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

📅 2024-06-09

🏛️ arXiv.org

📈 Citations: 20

✨ Influential: 1

career value

185K/year

🤖 AI Summary

Existing language model evaluation benchmarks suffer from overly abstract criteria, coarse granularity, and coverage bias. To address these limitations, we propose BiGGen Bench—the first generative evaluation benchmark targeting nine fine-grained capabilities (e.g., reasoning consistency, factual controllability) across 77 diverse tasks. Our method introduces instance-level dynamic evaluation criteria and a language model self-assessment paradigm, enabling capability disentanglement and balanced assessment via collaborative scoring by multiple evaluator LMs, task-aware prompt engineering, and a structured evaluation protocol. We further develop an extensible, reproducible, and fully open-source automated evaluation framework. Comprehensive evaluation of 103 state-of-the-art models reveals critical capability bottlenecks across dimensions. All code, data, and results are publicly released.

Technology Category

Application Category

📝 Abstract

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMs with granular, human-like criteria

Addressing coverage bias in current LM benchmarks

Assessing nine distinct LM capabilities across 77 tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BiGGen Bench for fine-grained LM evaluation

Uses instance-specific criteria mirroring human assessment

Evaluates 103 LMs across 77 diverse tasks

🔎 Similar Papers

No similar papers found.