Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study addresses the inconsistent calibration problem in large language models (LLMs). We systematically investigate the joint impact of prompt style, model scale, and inter-model response consistency on calibration performance. To this end, we propose Calib-n—a lightweight auxiliary calibration framework that aggregates responses from multiple LLMs, jointly optimizes calibration via focal loss and an AUC surrogate loss, and supports few-shot prompt-based fine-tuning. Evaluated across a standardized benchmark comprising 12 mainstream LLMs and four prompt paradigms, Calib-n significantly reduces Expected Calibration Error (ECE), especially in few-shot settings; its calibrated outputs exhibit strong robustness and consistently outperform both native LLM probability outputs and verbalized confidence expressions. Notably, we identify response consistency as a critical calibration modulator—a finding that establishes a novel paradigm for trustworthy LLM deployment.

Technology Category

Application Category

📝 Abstract

Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs' internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Calibration

Confidence-Accuracy Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calib-n

prompt engineering

model calibration

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?