🤖 AI Summary
This work addresses the poor calibration of post-trained large language models, which often exhibit overconfidence due to their failure to account for dynamic shifts during inference. The authors propose Dual-Align, a novel unsupervised post-hoc calibration framework that, for the first time, explicitly distinguishes and jointly mitigates two distinct sources of miscalibration: confidence drift and process drift. Dual-Align achieves this through a dual mechanism—confidence alignment via output distribution matching and process alignment by re-stabilizing intermediate reasoning paths—using only a single temperature parameter. Extensive experiments demonstrate that Dual-Align significantly outperforms existing methods across multiple benchmarks, substantially reducing calibration error and approaching the performance of supervised oracle approaches, all while preserving the model’s original task performance.
📝 Abstract
Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.