SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in multi-hop question answering where large language models often produce seemingly correct answers that mask unsupported or erroneous reasoning steps, leading to distorted evaluation. To tackle this issue, the authors propose the SAFE framework, which establishes a two-stage dynamic verifiability mechanism. During training, it constructs an atomic-level error taxonomy grounded in knowledge graphs to filter noisy supervision signals. At inference time, a feedback model is introduced to detect and rectify unreliable reasoning steps in real time, thereby generating fully verifiable reasoning traces. This approach achieves the first dual-phase verifiable system tailored for multi-hop reasoning, identifying 14% of unanswerable samples on standard benchmarks and improving average reasoning accuracy by 8.4 percentage points.
📝 Abstract
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
Problem

Research questions and friction points this paper is trying to address.

Multi-hop Reasoning
Error Correction
Grounded Reasoning
Benchmarking
Chain-of-Thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
error correction
knowledge graph grounding
verifiable reasoning
dynamic benchmarking
🔎 Similar Papers
No similar papers found.
D
Daeyong Kwon
Seoul National University, South Korea
S
Soyoung Yoon
Seoul National University, South Korea
Seung-won Hwang
Seung-won Hwang
Seoul National University
language/data understanding