Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) often degrades safety alignment—particularly the “admission of ignorance” capability—leading to increased hallucination. Method: This work first systematically uncovers the mechanism by which fine-tuning erodes ignorance awareness. It proposes a novel dual-path paradigm: (1) sparse parameter updates constrained to mitigate activation drift, and (2) entity perturbation combined with KL-divergence regularization to jointly disentangle entangled knowledge representations. Contribution/Results: The method preserves task performance while robustly maintaining ignorance awareness—achieving a 32.7% improvement in ignorance-expression accuracy and a 41.5% reduction in hallucination rate under multi-task fine-tuning, with zero task-performance degradation. Additionally, the paper introduces a quantitative evaluation framework for safety alignment capabilities, establishing a reproducible benchmark for future research in this domain.

Technology Category

Application Category

📝 Abstract

Existing work on mitigating catastrophic forgetting in large language model (LLM) fine-tuning has primarily focused on preserving specific data or tasks, while critically overlooking the degradation of essential capabilities instilled through safety alignment, particularly the model's ability to faithfully express ignorance. In this work, we show that this capability is significantly degraded during conventional fine-tuning, leading to undesired behaviors such as hallucinations. To address this novel but highly practical problem, we propose SEAT, a simple and effective fine-tuning approach that preserves both fine-tuning performance and the model's inherent ability to acknowledge its ignorance. SEAT integrates two key components: (1) sparse training that constrains activation drift, and (2) a novel entity perturbation method with KL-divergence regularization, designed to counter knowledge entanglement. Experimental results demonstrate that SEAT significantly outperforms baselines in preserving ignorance awareness while retaining fine-tuning performance, offering a more robust solution for LLM fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Preserves ignorance awareness in LLM fine-tuning

Prevents degradation of safety alignment capabilities

Reduces hallucinations during model fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse training to limit activation drift

Entity perturbation with KL-divergence regularization

Preserves ignorance awareness during fine-tuning

🔎 Similar Papers

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning