Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

πŸ“… 2026-05-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

198K/year
πŸ€– AI Summary
This work addresses the challenge of hallucination and logical inconsistency in multi-hop fact verification, which often arises from large language models’ failure to explicitly model causal dependencies between evidence and claims. The study introduces structural causal models (SCMs) into this task for the first time, reframing multi-hop reasoning as a causal inference problem. To dynamically balance the depth and conciseness of reasoning chains, the authors propose a Group Relative Policy Optimization (GRPO) mechanism that integrates rule-based reinforcement learning. Empirical results reveal an inverted U-shaped relationship between reasoning chain length and accuracy, with the proposed approach significantly outperforming state-of-the-art methods on both the HoVer and EX-FEVER benchmarks while simultaneously enhancing model accuracy and interpretability.
πŸ“ Abstract
Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
Problem

Research questions and friction points this paper is trying to address.

Multi-Hop Fact Verification
Large Language Models
Causal Dependencies
Hallucinations
Logical Chains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural Causal Model
Multi-Hop Fact Verification
Group Relative Policy Optimization
Causal Reasoning
Reinforcement Learning
πŸ”Ž Similar Papers
2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97
Y
Yunhan Bu
School of Computer Science and Technology, Xinjiang University, Urumqi, China
Q
Quan Zhang
Military Science Information Research Center, Academy of Military Science, Beijing, China
Huaping Zhang
Huaping Zhang
Beijing Institute of Technology
Natural Language ProcessingBig Data AnalysisLarge Language ProcessingPsychological Analysis
G
Guotong Geng
Military Science Information Research Center, Academy of Military Science, Beijing, China
C
Chunxiao Gao
Beijing Institute of Technology, Beijing, China
A
Askar Hamdulla
School of Computer Science and Technology, Xinjiang University, Urumqi, China
J
Juan Wang
Beijing Institute of Technology, Beijing, China
Qiuchi Li
Qiuchi Li
University of Copenhagen
Information RetrievalNatural Language ProcessingMachine Learning
B
Baohua Zhang
Beijing Institute of Technology, Beijing, China
S
Shuai Lei
Military Science Information Research Center, Academy of Military Science, Beijing, China
Yunbo Cao
Yunbo Cao
Tencent Corporation
Natural Language ProcessingDialogue SystemsKnowledge Mining
Zhunchen Luo
Zhunchen Luo
Unknown affiliation