When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large language models (LLMs) exhibit low recall, poor precision, and irreproducible outputs in automated verification of academic papers. Method: We introduce SPOT—the first empirical benchmark for scientific credibility verification—built upon 83 peer-reviewed papers and 91 expert-annotated critical scientific errors (many leading to corrigenda or retractions), validated via multi-round LLM reasoning (e.g., o3), human cross-verification, and domain-expert qualitative analysis. Contribution/Results: Our systematic evaluation reveals that state-of-the-art LLMs achieve only 21.1% recall and 6.1% precision on SPOT, suffering from conceptual misunderstandings, miscalibrated confidence, and low error reproducibility. SPOT fills a critical gap in evaluation benchmarks for scholarly integrity and empirically demonstrates that current LLMs lack reliable capability for academic verification—providing an essential foundation for developing trustworthy AI-assisted research tools.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the extbf{academic verification of scientific manuscripts}. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1% recall or 6.1% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

Problem

Research questions and friction points this paper is trying to address.

Automating academic verification of scientific manuscripts using LLMs

Evaluating LLMs' performance in detecting significant scientific errors

Assessing reliability and precision of AI in academic error detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs as verifiers for academic manuscripts

SPOT dataset with real paper errors

Evaluating LLM recall and precision rates

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery