Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of factual hallucinations in intermediate reasoning steps of small language models under resource-constrained settings, where conventional outcome-based reinforcement learning may erroneously reinforce unfaithful reasoning paths due to correct final answers. To mitigate this issue, the authors propose FaithRL, a novel approach that introduces step-level faithfulness supervision into reinforcement learning for the first time. FaithRL employs a process reward model to deliver explicit step-level faithfulness rewards and incorporates a truncation-based resampling strategy to generate implicit contrastive signals, thereby guiding the model to learn from faithful reasoning prefixes. Experimental results demonstrate that FaithRL significantly reduces hallucinations in both reasoning chains and final answers across multiple small language models and open-domain question answering benchmarks, effectively enhancing the faithfulness and reliability of model reasoning.

Technology Category

Application Category

📝 Abstract
As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
Problem

Research questions and friction points this paper is trying to address.

faithfulness hallucinations
small reasoning models
chain-of-thought
step-level reasoning
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level reinforcement learning
faithfulness-aware
hallucination mitigation
chain-of-thought reasoning
small reasoning models
🔎 Similar Papers
No similar papers found.
S
Shuo Nie
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China; Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd
H
Hexuan Deng
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China; Zhongguancun Academy, Beijing, China
C
Chao Wang
Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd
R
Ruiyu Fang
Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd
Xuebo Liu
Xuebo Liu
Associate Professor of Computer Science, Harbin Institute of Technology, Shenzhen
Large Language ModelsNatural Language ProcessingMachine Translation
S
Shuangyong Song
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
Y
Yu Li
College of Integrated Circuits, Zhejiang University, Hangzhou, Zhejiang, China
M
Min Zhang
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd