π€ AI Summary
Existing language model evaluation benchmarks suffer from overly abstract criteria, coarse granularity, and coverage bias. To address these limitations, we propose BiGGen Benchβthe first generative evaluation benchmark targeting nine fine-grained capabilities (e.g., reasoning consistency, factual controllability) across 77 diverse tasks. Our method introduces instance-level dynamic evaluation criteria and a language model self-assessment paradigm, enabling capability disentanglement and balanced assessment via collaborative scoring by multiple evaluator LMs, task-aware prompt engineering, and a structured evaluation protocol. We further develop an extensible, reproducible, and fully open-source automated evaluation framework. Comprehensive evaluation of 103 state-of-the-art models reveals critical capability bottlenecks across dimensions. All code, data, and results are publicly released.
π Abstract
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.