CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the critical issue of “superficially plausible yet unsupported claims” in large language model (LLM)-generated scientific paper reviews. To tackle claim grounding—the alignment between review weaknesses and explicit claims in the paper—we introduce CLAIMCHECK, the first benchmark dedicated to scientific claim alignment evaluation. Built upon NeurIPS 2023/2024 submissions and official reviews, CLAIMCHECK features fine-grained expert annotations mapping review weaknesses to original claims, assessing their validity, objectivity, and type. Based on this, we define three novel tasks: weakness–claim linkage, weakness rewriting, and claim verification. Our methodology integrates OpenReview data mining, expert human annotation, and multi-task LLM evaluation (matching, classification, generation, and reasoning-based verification). Experiments reveal that state-of-the-art LLMs significantly underperform humans on linkage and verification tasks—while achieving only moderate accuracy on weakness labeling—highlighting claim grounding as a fundamental bottleneck in automating scientific peer review.

Technology Category

Application Category

📝 Abstract

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

Problem

Research questions and friction points this paper is trying to address.

Assess grounding of LLM critiques in scientific claims

Benchmark LLMs on claim-centric review tasks

Evaluate LLM performance versus human experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing CLAIMCHECK dataset for LLM benchmarking

Annotating weaknesses and disputed claims in reviews

Benchmarking LLMs on claim-centric review tasks

🔎 Similar Papers

Can Large Language Models Detect Misinformation in Scientific News Reporting?