🤖 AI Summary
This work addresses the challenge of diagnosing error types in multi-hop question answering, where reasoning steps are often obscured by missing evidence. The authors propose a decomposable hybrid NLI–LLM decision tree architecture that performs fine-grained detection at each reasoning step, assigning one of three actionable evidence-gap labels: contradictory claim, irrelevant evidence, or missing bridge. By isolating errors at the step level, the method prevents internal mistakes in large language models from masking one another, thereby exposing the limitations of question-level F1 evaluation and underscoring the importance of step-level metrics. These diagnostic labels are further leveraged as typified process rewards in GRPO reinforcement learning. Evaluated on 181 annotated steps, the approach achieves a step-level F1 of 72.0, significantly outperforming baselines, and when used as a reward signal, improves the Exact Match score of Qwen2.5-7B-Instruct by 3.3 percentage points (up to +5.6 in single runs).
📝 Abstract
We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $κ{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, while three of four LLM-only removals \emph{improve} F1 -- a sign of \emph{competing-error cancellation}, where internal stages mask each other's errors. We further expose a \emph{Q-F1 trap}: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from $32.1{\pm}0.3$ to $35.4{\pm}0.9$ across three seeds, with the single-run comparison showing a $+5.6$ Avg EM gain over the matched Search-R1 GRPO reproduction.