CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent issues in multi-hop question answering—such as reasoning collapse, inconsistency between answers and reasoning traces, and loss of output format control—where models often produce correct answers without reliable reasoning processes. To this end, the authors propose CRAFT, a novel framework that, for the first time, integrates structural determinism rewards with discriminator-based semantic faithfulness rewards. Leveraging Group Relative Policy Optimization (GRPO), a reinforcement learning approach, CRAFT jointly optimizes answer accuracy, structural correctness, and semantic faithfulness of reasoning trajectories. The framework enables controllable generation of reasoning paths and facilitates systematic analysis of the impact of reasoning structure and scale. Evaluated on three multi-hop QA benchmarks, CRAFT-7B substantially improves both answer accuracy and reasoning faithfulness, achieving performance comparable to that of proprietary large language models.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.
Problem

Research questions and friction points this paper is trying to address.

multi-hop question answering
reasoning collapse
reasoning-answer inconsistency
format control
retrieval-augmented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

CRAFT
reinforcement learning
reasoning faithfulness
multi-hop question answering
structured reasoning
🔎 Similar Papers
No similar papers found.
Y
Yu Liu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Wenxiao Zhang
Wenxiao Zhang
University of Western Australia
LLMRoboticsEmbodied AISecurity
C
Cong Cao
Institute of Information Engineering, Chinese Academy of Sciences
F
Fangfang Yuan
Institute of Information Engineering, Chinese Academy of Sciences
W
Weizhuo Chen
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
C
Cheng Hu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
P
Pin Xu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Y
Yuling Yang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
K
Kun Peng
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Diandian Guo
Diandian Guo
The Chinese University of Hong Kong
Deep learning
Qiang Sun
Qiang Sun
PhD Candidate in The University of Western Australia
Knowledge GraphGraph EmbeddingGraph RAGGraph Reasoning
Y
Yanbing Liu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Jin B. Hong
Jin B. Hong
The University of Western Australia
CybersecurityMoving Target DefensePrivacy
Zhiyuan Ma
Zhiyuan Ma
University of Science and Technology of China
Knowledge reasoning