Does Alignment Tuning Really Break LLMs' Internal Confidence?

📅 2024-08-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work identifies a fundamental tension between alignment fine-tuning and calibration of large language models (LLMs): while alignment improves instruction-following capability, it systematically degrades confidence calibration. Method: We develop a multidimensional evaluation framework, conducting controlled cross-model and cross-task comparisons across model architectures, task types, calibration metrics, and confidence estimation methods (logits, softmax, token-prob). Contribution/Results: Under rigorously controlled conditions, we provide the first empirical evidence of this trade-off—alignment fine-tuning increases Expected Calibration Error (ECE) by 37–62%. These findings challenge the prevailing assumption that alignment and calibration are mutually compatible, call for a paradigm shift in confidence evaluation, and establish critical theoretical and empirical foundations for designing new algorithms that jointly optimize instruction following and calibration.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.

Problem

Research questions and friction points this paper is trying to address.

Analyzes calibration degradation in Large Language Models

Explores alignment impact on model confidence and calibration

Seeks methods to maintain both alignment and calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes LLM calibration degradation

Examines alignment impact on calibration

Proposes dual-achievement algorithms for LLMs

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security