🤖 AI Summary
This work investigates large language models’ (LLMs) capabilities in evidence-based claim verification, specifically evaluating deductive versus abductive reasoning. To this end, we introduce RECV—the first benchmark featuring real-world claims with fine-grained, atomic-level annotations of reasoning types—and propose a reasoning-type decomposition evaluation framework. Through systematic assessment of mainstream closed-source LLMs across multiple difficulty levels and prompting strategies, complemented by semantic similarity analysis, we find that: (1) LLMs exhibit robust performance on deductive reasoning but suffer from systematic failures in abductive reasoning; (2) generated explanations achieve high semantic similarity to human-written ones—especially for deductive tasks—but rationalization does not consistently improve verification accuracy. This study provides the first empirical evidence of LLMs’ fundamental limitations in abductive reasoning, establishing a novel, trustworthy benchmark and methodology for rigorous reasoning evaluation.
📝 Abstract
Although LLMs have shown great performance on Mathematics and Coding related reasoning tasks, the reasoning capabilities of LLMs regarding other forms of reasoning are still an open problem. Here, we examine the issue of reasoning from the perspective of claim verification. We propose a framework designed to break down any claim paired with evidence into atomic reasoning types that are necessary for verification. We use this framework to create Reasoning in Evidence-based Claim Verification (RECV), the first claim verification benchmark, incorporating real-world claims, to assess the deductive and abductive reasoning capabilities of LLMs. The benchmark comprises of three datasets, covering reasoning problems of increasing complexity. We evaluate three state-of-the-art proprietary LLMs under multiple prompt settings. Our results show that while LLMs can address deductive reasoning problems, they consistently fail in cases of abductive reasoning. Moreover, we observe that enhancing LLMs with rationale generation is not always beneficial. Nonetheless, we find that generated rationales are semantically similar to those provided by humans, especially in deductive reasoning cases.