🤖 AI Summary
This study systematically evaluates the reliability and human-expert consistency of large language models (LLMs) as automated evaluators for NLP tasks. Method: Leveraging a multi-task human-annotated benchmark, we assess zero-shot and few-shot LLM-based evaluation across dimensions—including fluency, factual consistency, and logical coherence—in story generation and mathematical reasoning, employing consistency metrics and fine-grained error attribution. Contribution/Results: We uncover, for the first time, systematic biases in LLMs’ generation of evaluation criteria. To address this, we propose “pre-writing human–AI collaborative evaluation”: LLMs first generate structured, criterion-grounded rationales, which humans then calibrate. Experiments show this paradigm improves human evaluation objectivity by 27% and substantially mitigates subjectivity and outlier annotations. While LLMs achieve near-human performance on general dimensions (e.g., fluency), they remain significantly lagging on complex, quantitative reasoning criteria.
📝 Abstract
Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation.