Does Alignment Tuning Really Break LLMs' Internal Confidence?

πŸ“… 2024-08-31
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies a fundamental tension between alignment fine-tuning and calibration of large language models (LLMs): while alignment improves instruction-following capability, it systematically degrades confidence calibration. Method: We develop a multidimensional evaluation framework, conducting controlled cross-model and cross-task comparisons across model architectures, task types, calibration metrics, and confidence estimation methods (logits, softmax, token-prob). Contribution/Results: Under rigorously controlled conditions, we provide the first empirical evidence of this trade-offβ€”alignment fine-tuning increases Expected Calibration Error (ECE) by 37–62%. These findings challenge the prevailing assumption that alignment and calibration are mutually compatible, call for a paradigm shift in confidence evaluation, and establish critical theoretical and empirical foundations for designing new algorithms that jointly optimize instruction following and calibration.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
Problem

Research questions and friction points this paper is trying to address.

Analyzes calibration degradation in Large Language Models
Explores alignment impact on model confidence and calibration
Seeks methods to maintain both alignment and calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes LLM calibration degradation
Examines alignment impact on calibration
Proposes dual-achievement algorithms for LLMs
πŸ”Ž Similar Papers
No similar papers found.