Why Fine-Tuning Encourages Hallucinations and How to Fix It

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the issue of factual hallucination in supervised fine-tuning, which often arises when new knowledge disrupts the pre-existing knowledge of pretrained language models. Viewing this problem through the lens of continual learning, the study identifies localized interference among semantic representations as a primary cause of such hallucinations. To mitigate this, the authors propose a fine-tuning approach that integrates self-distillation with selective parameter freezing. Specifically, self-distillation constrains undesirable shifts in the model’s output distribution, while freezing a subset of parameters—when no new knowledge needs to be acquired—suppresses excessive factual plasticity. This method effectively preserves task performance while significantly enhancing factual consistency, thereby alleviating hallucination induced by standard fine-tuning procedures.

Technology Category

Application Category

📝 Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Problem

Research questions and friction points this paper is trying to address.

hallucination

supervised fine-tuning

knowledge degradation

large language models

factual consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

supervised fine-tuning

hallucination mitigation