CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

📅 2024-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-as-a-Judge approaches rely on subjective Likert-scale scoring, resulting in low cross-model rating consistency, high variance, and weak correlation with human judgments. This paper proposes CheckEval: a checklist-based evaluation framework grounded in decomposable binary questions, replacing subjective scoring with objective, structured checklists. Its core innovation is the first-ever checklist-driven binary judgment paradigm, integrated with multi-model collaborative evaluation and binary decision aggregation. Experiments across 12 judge models and multiple benchmark datasets demonstrate that CheckEval improves average human correlation by 0.10, enhances cross-model rating consistency by 0.45, and significantly reduces rating variance. Moreover, it strengthens interpretability and traceability of evaluation outcomes by explicitly linking each judgment to verifiable checklist items.

Technology Category

Application Category

📝 Abstract
Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments, improving the average correlation with human judgments by 0.10. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
Problem

Research questions and friction points this paper is trying to address.

Addresses rating inconsistencies in LLM-as-a-Judge text evaluation.
Improves reliability using checklist-based binary questions.
Enhances agreement and reduces variance across evaluator models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Checklist-based framework improves rating reliability
Decomposed binary questions enhance evaluator agreement
Traceable binary decisions increase score interpretability
🔎 Similar Papers
No similar papers found.
Y
Yukyung Lee
Korea University, Seoul, Republic of Korea
J
Joonghoon Kim
Korea University, Seoul, Republic of Korea
J
Jaehee Kim
Korea University, Seoul, Republic of Korea
H
Hyowon Cho
KAIST, Seoul, Republic of Korea
P
Pilsung Kang
Korea University, Seoul, Republic of Korea
Jaewook Kang
Jaewook Kang
Najoung Kim
Najoung Kim
Boston University
Cognitive ScienceLinguisticsComputational Linguistics