The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

πŸ“… 2024-06-09
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 20
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Existing language model evaluation benchmarks suffer from overly abstract criteria, coarse granularity, and coverage bias. To address these limitations, we propose BiGGen Benchβ€”the first generative evaluation benchmark targeting nine fine-grained capabilities (e.g., reasoning consistency, factual controllability) across 77 diverse tasks. Our method introduces instance-level dynamic evaluation criteria and a language model self-assessment paradigm, enabling capability disentanglement and balanced assessment via collaborative scoring by multiple evaluator LMs, task-aware prompt engineering, and a structured evaluation protocol. We further develop an extensible, reproducible, and fully open-source automated evaluation framework. Comprehensive evaluation of 103 state-of-the-art models reveals critical capability bottlenecks across dimensions. All code, data, and results are publicly released.

Technology Category

Application Category

πŸ“ Abstract
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LMs with granular, human-like criteria
Addressing coverage bias in current LM benchmarks
Assessing nine distinct LM capabilities across 77 tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BiGGen Bench for fine-grained LM evaluation
Uses instance-specific criteria mirroring human assessment
Evaluates 103 LMs across 77 diverse tasks
πŸ”Ž Similar Papers
No similar papers found.
Seungone Kim
Seungone Kim
Carnegie Mellon University
Large Language ModelsNatural Language Processing
Juyoung Suk
Juyoung Suk
KAIST
Large Language Models
J
Ji Yong Cho
LG AI Research
Shayne Longpre
Shayne Longpre
MIT, Stanford, Apple
Deep LearningNatural Language Understanding
Chaeeun Kim
Chaeeun Kim
LBOX
D
Dongkeun Yoon
KAIST
Guijin Son
Guijin Son
Undergraduate, Yonsei University
Natural Language ProcessingLarge Language Models
Y
Yejin Cho
KAIST
S
Sheikh Shafayat
KAIST
Jinheon Baek
Jinheon Baek
Ph.D. student, KAIST
Machine LearningNatural Language ProcessingRAG
Sue Hyun Park
Sue Hyun Park
AI Research Center, KRAFTON Inc.
Large Language ModelsAlignment
Hyeonbin Hwang
Hyeonbin Hwang
KAIST
Large Language ModelsReasoning
J
Jinkyung Jo
KAIST
H
Hyowon Cho
KAIST
Haebin Shin
Haebin Shin
KAIST
Machine LearningNatural Language Processing
Seongyun Lee
Seongyun Lee
KAIST AI
NLPLLMMultimodal
Hanseok Oh
Hanseok Oh
Mila
Natural Language ProcessingInformation RetrievalMachine LearningAgent
Noah Lee
Noah Lee
KAIST AI
LLMAlignment
Namgyu Ho
Namgyu Ho
PhD student at KAIST
large language modelsreasoninginference efficiencyadaptive computation
Se June Joo
Se June Joo
KAIST
Miyoung Ko
Miyoung Ko
KAIST
Natural Language ProcessingMachine Learning
Yoonjoo Lee
Yoonjoo Lee
KAIST
Human Computer InteractionNatural Language Processing
Hyungjoo Chae
Hyungjoo Chae
Georgia Institute of Technology
GUI AgentDigital AgentLLM Agent
J
Jamin Shin
KAIST
Joel Jang
Joel Jang
Research Scientist, Nvidia
Seonghyeon Ye
Seonghyeon Ye
KAIST
Machine LearningRobot Learning
Bill Yuchen Lin
Bill Yuchen Lin
Affiliate Assistant Professor, University of Washington
Natural Language ProcessingMachine LearningLarge Language ModelsAI
S
Sean Welleck
Carnegie Mellon University
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence
Moontae Lee
Moontae Lee
Head of Superintelligence Lab at LG AI Research | Assistant Professor at the University of Illinois
Large Language Models | Foundation Models | World Models
K
Kyungjae Lee
LG AI Research
Minjoon Seo
Minjoon Seo
Config Intelligence; KAIST
Artificial IntelligenceLanguage Modeling