Why Fine-Tuning Encourages Hallucinations and How to Fix It

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the issue of factual hallucination in supervised fine-tuning, which often arises when new knowledge disrupts the pre-existing knowledge of pretrained language models. Viewing this problem through the lens of continual learning, the study identifies localized interference among semantic representations as a primary cause of such hallucinations. To mitigate this, the authors propose a fine-tuning approach that integrates self-distillation with selective parameter freezing. Specifically, self-distillation constrains undesirable shifts in the model’s output distribution, while freezing a subset of parameters—when no new knowledge needs to be acquired—suppresses excessive factual plasticity. This method effectively preserves task performance while significantly enhancing factual consistency, thereby alleviating hallucination induced by standard fine-tuning procedures.

Technology Category

Application Category

📝 Abstract
Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.
Problem

Research questions and friction points this paper is trying to address.

hallucination
supervised fine-tuning
knowledge degradation
large language models
factual consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation
supervised fine-tuning
hallucination mitigation
continual learning
output-distribution regularization
🔎 Similar Papers
No similar papers found.