Are We Done with MMLU?

📅 2024-06-06
🏛️ arXiv.org
📈 Citations: 13
Influential: 1
📄 PDF
🤖 AI Summary
The MMLU benchmark contains systematic annotation errors—6.49% overall, rising to 57% in the Virology subset—severely distorting model capability evaluation. Method: We propose the first error identification and correction framework for large-scale knowledge evaluation datasets, integrating multi-expert collaborative verification, cross-disciplinary knowledge validation, and statistical error analysis, alongside a standardized annotation protocol. Contribution/Results: We quantitatively characterize MMLU’s error distribution and release MMLU-Redux—a manually re-annotated, high-quality subset comprising 5,700 questions spanning all 57 subjects. Experiments demonstrate that the original MMLU substantially overestimates model performance, whereas MMLU-Redux enables more robust and trustworthy evaluation. The dataset is publicly released, establishing critical infrastructure for rigorous, standardized LLM evaluation.

Technology Category

Application Category

📝 Abstract
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. We estimate that 6.49% of MMLU questions contain errors. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.
Problem

Research questions and friction points this paper is trying to address.

MMLU Test Errors
Language Model Assessment
Reliable Dataset Creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MMLU-Redux
Error Correction
Language Model Evaluation
🔎 Similar Papers
No similar papers found.