🤖 AI Summary
This work addresses the challenge in multi-hop question answering where large language models often produce seemingly correct answers that mask unsupported or erroneous reasoning steps, leading to distorted evaluation. To tackle this issue, the authors propose the SAFE framework, which establishes a two-stage dynamic verifiability mechanism. During training, it constructs an atomic-level error taxonomy grounded in knowledge graphs to filter noisy supervision signals. At inference time, a feedback model is introduced to detect and rectify unreliable reasoning steps in real time, thereby generating fully verifiable reasoning traces. This approach achieves the first dual-phase verifiable system tailored for multi-hop reasoning, identifying 14% of unanswerable samples on standard benchmarks and improving average reasoning accuracy by 8.4 percentage points.
📝 Abstract
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.