Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

๐Ÿ“… 2024-10-24
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 8
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Widespread label noise (10โ€“25%) in NLP benchmark datasets leads to systematic underestimation of model performance, with many purported โ€œLLM failuresโ€ attributable to annotation errors rather than model limitations. Method: We propose LLM-as-a-judgeโ€”a framework leveraging ensemble judgments from GPT-4, Claude, and Llama, combined with consistency voting and error-sensitivity analysis to automatically detect mislabeled instances; we further apply label smoothing and confident learning for robust label recalibration. Contribution/Results: Comprehensive evaluation across the TRUE benchmark suite reveals substantial disparities in quality and efficiency among expert, crowdsourced, and LLM-generated annotations. After correction, state-of-the-art models achieve average accuracy gains of 3.2โ€“7.8 percentage points. This work provides the first empirical evidence of systematic label-noise interference in LLM evaluation and introduces a scalable, collaborative adjudication paradigm that reframes data correction as model performance recalibration.

Technology Category

Application Category

๐Ÿ“ Abstract
NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
Problem

Research questions and friction points this paper is trying to address.

Detecting label errors in NLP benchmark datasets
Comparing annotation quality from experts, crowdsourcing, and LLMs
Mitigating mislabeled data effects on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLM ensemble to detect label errors
Comparing expert crowd-sourced and LLM annotation quality
Correcting label errors to improve model performance
๐Ÿ”Ž Similar Papers
No similar papers found.