Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

164K/year

🤖 AI Summary

Preference alignment methods such as Direct Preference Optimization (DPO) can inadvertently exacerbate hallucinations in large language models by favoring fluent and confident—but potentially inaccurate—responses. To address this, this work proposes Factuality-aware DPO (F-DPO), which enhances standard DPO by incorporating binary factuality labels to invert preference rankings when necessary and introduces a factuality-aware margin that amplifies learning from samples with pronounced factual discrepancies. Requiring only binary labels—and no auxiliary reward models, token-level annotations, or multi-stage training—F-DPO significantly simplifies the alignment pipeline while improving generalization. Evaluated across seven open-source models ranging from 1B to 14B parameters, F-DPO reduces hallucination rates by up to fivefold (e.g., on Qwen3-8B) and boosts factuality scores by 50%. On TruthfulQA, it improves MC1 and MC2 accuracy by 17% and 49%, respectively.

Technology Category

Application Category

📝 Abstract

Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by five times (from 0.424 to 0.084) while improving factuality scores by 50 percent (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves plus 17 percent MC1 accuracy (0.500 to 0.585) and plus 49 percent MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.

Problem

Research questions and friction points this paper is trying to address.

hallucination

factuality

preference alignment

large language models

truthfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Factuality-aware Learning

Hallucination Reduction

Direct Preference Optimization