🤖 AI Summary
In claim verification, self-improvement methods suffer from erroneous reasoning contamination and performance degradation due to low-quality reasoning chains misaligned with binary truth labels. To address this, we propose a structured reasoning framework specifically designed for claim verification, comprising three stages: claim decomposition, entity analysis, and evidence-anchored verification. We introduce the first multi-granularity supervision signal design, incorporating structural correctness constraints into the self-improvement paradigm. Our approach integrates structured prompt engineering, phased fine-tuning (warm-up followed by self-improvement), and a dual-filtering mechanism based on both reasoning-chain quality and answer consistency. Evaluated on the HOVER dataset, our method achieves a 31.4% improvement over strong baselines and outperforms standard chain-of-thought approaches by 20.7%, significantly enhancing reasoning reliability and verification accuracy.
📝 Abstract
Claim verification is the task of determining whether a claim is supported or refuted by evidence. Self-improvement methods, where reasoning chains are generated and those leading to correct results are selected for training, have succeeded in tasks like mathematical problem solving. However, in claim verification, this approach struggles. Low-quality reasoning chains may falsely match binary truth labels, introducing faulty reasoning into the self-improvement process and ultimately degrading performance. To address this, we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. These components improve reasoning quality, reduce errors, and provide additional supervision signals for self-improvement. STRIVE begins with a warm-up phase, where the base model is fine-tuned on a small number of annotated examples to learn the structured reasoning design. It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training. We demonstrate that STRIVE achieves significant improvements over baseline models, with a 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets, highlighting its effectiveness.