The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

While on-policy distillation (OPD) enhances task performance, it often leads to overconfident and miscalibrated models due to the mismatch between privileged information available during training and the limited observations at deployment. This work proposes CaOPD, a calibration-aware on-policy distillation framework that, for the first time, reveals the decoupling between capability distillation and confidence calibration. CaOPD introduces a student-centric empirical confidence objective, replacing the teacher’s self-reported confidence with empirical confidence estimates derived from student rollouts, thereby achieving a Pareto-optimal balance between calibration and task performance. Experiments demonstrate that CaOPD significantly improves calibration across diverse models and tasks while maintaining competitive accuracy, and exhibits strong robustness in out-of-distribution generalization and continual learning scenarios.

Technology Category

Application Category

📝 Abstract

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

miscalibration

overconfidence

confidence calibration

privileged context

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

calibration

overconfidence