CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the prevalent issues in multi-hop question answering—such as reasoning collapse, inconsistency between answers and reasoning traces, and loss of output format control—where models often produce correct answers without reliable reasoning processes. To this end, the authors propose CRAFT, a novel framework that, for the first time, integrates structural determinism rewards with discriminator-based semantic faithfulness rewards. Leveraging Group Relative Policy Optimization (GRPO), a reinforcement learning approach, CRAFT jointly optimizes answer accuracy, structural correctness, and semantic faithfulness of reasoning trajectories. The framework enables controllable generation of reasoning paths and facilitates systematic analysis of the impact of reasoning structure and scale. Evaluated on three multi-hop QA benchmarks, CRAFT-7B substantially improves both answer accuracy and reasoning faithfulness, achieving performance comparable to that of proprietary large language models.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.

Problem

Research questions and friction points this paper is trying to address.

multi-hop question answering

reasoning collapse

reasoning-answer inconsistency

format control

retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CRAFT

reinforcement learning

reasoning faithfulness