🤖 AI Summary
Instruction fine-tuning (IFT) enhances large language models’ (LLMs) utility but often degrades factual consistency, primarily because models over-rely on long-tail knowledge insufficiently covered during pretraining, thereby increasing hallucination. This work presents the first systematic analysis of the utility–fidelity trade-off inherent in IFT and proposes UNIT, an uncertainty-aware instruction fine-tuning paradigm. UNIT jointly models response generation and uncertainty estimation, incorporating explicit uncertainty reflection tokens and confidence-driven response suffix injection. Evaluated under a multi-dimensional fidelity assessment protocol across multiple benchmarks, UNIT reduces hallucination rates by 38.2% on average while preserving or improving task completion rates and user preference scores. Thus, it achieves synergistic enhancement of both utility and factual fidelity—resolving the longstanding tension between helpfulness and truthfulness in LLM alignment.
📝 Abstract
Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language Models (LLMs), but it may lower their truthfulness. This trade-off arises because IFT steers LLMs to generate responses with long-tail knowledge that is not well covered during pre-training, leading to more informative but less truthful answers when generalizing to unseen tasks. In this paper, we empirically demonstrate this helpfulness-truthfulness trade-off in IFT and propose $ extbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs to recognize their uncertainty and explicitly reflect it at the end of their responses. Experimental results show that UNIT-tuned models maintain their helpfulness while distinguishing between certain and uncertain claims, thereby reducing hallucinations.