Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the poor calibration of post-trained large language models, which often exhibit overconfidence due to their failure to account for dynamic shifts during inference. The authors propose Dual-Align, a novel unsupervised post-hoc calibration framework that, for the first time, explicitly distinguishes and jointly mitigates two distinct sources of miscalibration: confidence drift and process drift. Dual-Align achieves this through a dual mechanism—confidence alignment via output distribution matching and process alignment by re-stabilizing intermediate reasoning paths—using only a single temperature parameter. Extensive experiments demonstrate that Dual-Align significantly outperforms existing methods across multiple benchmarks, substantially reducing calibration error and approaching the performance of supervised oracle approaches, all while preserving the model’s original task performance.

Technology Category

Application Category

📝 Abstract

Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.

Problem

Research questions and friction points this paper is trying to address.

confidence calibration

post-training

overconfidence

confidence drift

process drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration

dual alignment

process drift