Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) significantly impairs large language models’ (LLMs’) honesty—the ability to accurately acknowledge knowledge boundaries—without eroding their underlying capacity to recognize unknowns; instead, it suppresses the faithful *expression* of this awareness. This work proposes an efficient, low-data honesty restoration method: leveraging Hessian-guided identification and modification of honesty-critical neurons, combined with pretraining-state rollback and task-aware parameter co-optimization. It is the first to empirically uncover the “preserved recognition, suppressed expression” mechanism and demonstrates that honest output can be restored by tuning only a small subset of critical parameters. Evaluated across four QA benchmarks and five major LLM families, the method achieves an average 33.25% recovery in honesty, accelerates inference by ≥2.23×, and reduces data requirements by over one order of magnitude.

Technology Category

Application Category

📝 Abstract
The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

Recovering honesty in fine-tuned LLMs damaged by supervised fine-tuning
Restoring models' capacity to express knowledge boundary awareness
Addressing parameter-intensive recovery methods with surgical neuron repair
Innovation

Methods, ideas, or system contributions that make the work stand out.

Surgically restores honesty-critical neurons to pre-trained state
Harmonizes neurons via Hessian-guided compensation mechanism
Achieves parameter-efficient honesty recovery with minimal data
🔎 Similar Papers
No similar papers found.
Z
Zeyu Shi
SKLCCSE, School of Computer Science and Engineering, Beihang University
Z
Ziming Wang
SKLCCSE, School of Computer Science and Engineering, Beihang University
T
Tianyu Chen
SKLCCSE, School of Computer Science and Engineering, Beihang University
Shiqi Gao
Shiqi Gao
Beihang University
Haoyi Zhou
Haoyi Zhou
Associate Professor, Beihang University
Machine LearningData MiningTime-series
Qingyun Sun
Qingyun Sun
Assistant Professor, Beihang University
Data MiningGraph Machine LearningDeep Learning
J
Jianxin Li
SKLCCSE, School of Computer Science and Engineering, Beihang University; Zhongguancun Laboratory, Beijing