Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $ ightarrow$ Evidence Reasoning

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited capability in identifying and validating claim–evidence logical relationships within scientific papers, hindering deep scientific argument understanding. Method: We introduce CLAIM-BENCH, the first benchmark specifically designed for evaluating LLMs’ comprehension of scientific argumentation. We propose a three-pass and claim-wise prompting strategy, integrating divide-and-conquer prompt engineering, cross-paragraph multi-step reasoning, and structured output parsing. Evaluation is conducted systematically across six state-of-the-art LLMs—including GPT-4, Claude, and representative open-source models. Contribution/Results: Our findings reveal substantial limitations of current LLMs in complex scientific reasoning. Closed-source models consistently outperform open-source counterparts in both precision and recall. The optimal prompting strategy improves evidence-linking accuracy by up to 32%, underscoring the critical role of prompt design in enhancing scientific reasoning capabilities. CLAIM-BENCH thus establishes a rigorous foundation for advancing LLM-based scientific argument analysis.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to extract and validate scientific claim-evidence pairs

Comparing performance of six LLMs in scientific comprehension tasks

Assessing limitations and improvements in linking evidence to claims

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLAIM-BENCH benchmark for claim-evidence evaluation

Divide and conquer inspired three approaches

Three-pass and one-by-one prompting strategies

🔎 Similar Papers

Can Large Language Models Detect Misinformation in Scientific News Reporting?