To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

The increasing prevalence of objective errors—such as incorrect formulas, flawed derivations, and erroneous figures—in AI research papers lacks systematic, quantitative assessment. Method: We propose AI Checker, the first automated detection framework built on GPT-5, augmented by expert validation, to identify verifiable errors and generate correction suggestions in top-tier conference and journal publications. Our approach integrates large language model–based reasoning with formal verification logic to enhance both detection efficiency and interpretability. Results: Analysis of papers from 2018–2023 reveals a rising trend in average objective errors per paper, increasing from 3.8 to 5.9. AI Checker achieves an accuracy of 83.2% and produces actionable corrections for 75.8% of detected errors. This work is the first to empirically quantify and characterize the growth of objective errors in AI literature, establishing a scalable technical paradigm and evidence-based foundation for improving scientific reproducibility and quality assurance.

Technology Category

Application Category

📝 Abstract

How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

Problem

Research questions and friction points this paper is trying to address.

Systematically identify objective mistakes in published AI papers

Quantify increasing error rates in top AI conferences over time

Develop an AI tool to detect and correct errors automatically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using GPT-5 to systematically identify objective errors in AI papers

Focusing on verifiable mistakes in formulas, calculations, figures, and tables

Proposing correct fixes for a majority of the detected errors

🔎 Similar Papers

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence