On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

๐Ÿ“… 2024-06-18
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address hallucination in large language models (LLMs), this paper proposes RLFHโ€”a reinforcement learning from human feedback framework that replaces human annotation with a โ€œpolicy-as-judgeโ€ self-evaluation paradigm. Responses are decomposed into atomic facts, which are then assessed for factual accuracy and informativeness against external knowledge sources; these fine-grained evaluations are automatically mapped to token-level dense rewards, enabling online RL alignment without manual labeling. Key contributions include: (i) the first policy self-evaluation mechanism for LLMs; (ii) the first end-to-end mapping from sentence-level factual feedback to token-level reward signals; and (iii) overcoming limitations of offline fine-tuning and coarse-grained reward modeling. Evaluated on HotpotQA, SQuADv2, and Biography benchmarks, RLFH reduces hallucination rates significantly and improves factuality by 12.7% while preserving generation fluency and information completeness.

Technology Category

Application Category

๐Ÿ“ Abstract
Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present extit{{R}einforcement {L}earning {f}or {H}allucination} (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH's effectiveness in hallucination mitigation.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucination in language models
On-policy self-alignment with fine-grained feedback
Reinforcement learning for precise optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy self-alignment technique
Fine-grained knowledge feedback
Token-level dense reward signals
๐Ÿ”Ž Similar Papers
No similar papers found.
Xueru Wen
Xueru Wen
School of Computer Science and Technology, University of Chinese Academy of Sciences
Natural Language ProcessingAlignmentLarge Language Model
X
Xinyu Lu
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Xinyan Guan
Xinyan Guan
Institute of Software, Chinese Academy of Sciences
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Ben He
Ben He
Professor, University of Chinese Academy of Sciences
Natural Language ProcessingInformation Retrieval
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing