Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses a critical yet overlooked issue in continual fine-tuning of large language models: the predictive coverage—measuring the reliability of uncertainty estimates—degrades earlier and more severely than accuracy, signaling a collapse in calibration performance. The study is the first to systematically characterize this phenomenon and introduces Calibration Replay, a lightweight post-hoc calibration method that restores coverage to within ±2% of its nominal level using only a task-specific buffer of 200 samples. Remarkably, this approach incurs negligible training overhead and memory cost—less than 1% of conventional experience replay. Theoretical analysis provides finite-sample conformal validity guarantees, while experiments demonstrate that coverage loss is on average 3.4 times greater than accuracy loss, underscoring both the necessity and effectiveness of the proposed method.

Technology Category

Application Category

📝 Abstract

Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.

Problem

Research questions and friction points this paper is trying to address.

continual learning

large language models

calibration

conformal coverage

uncertainty reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual calibration

conformal coverage

calibration replay