A Quantitative Characterization of Forgetting in Post-Training

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the lack of quantitative understanding of catastrophic forgetting in continual post-training, particularly regarding the conditions and causes of mode collapse (quality forgetting) and drift of old components. The authors propose a bimodal mixture abstraction model that formally defines these two types of forgetting and analyzes their dynamic evolution under forward and reverse KL objectives using Gaussian mixture distributions. By leveraging the Bhattacharyya coefficient, importance weighting, and finite-batch theory, they provide the first quantitative characterization of forgetting severity, revealing the joint influence of KL direction, mode separability, behavioral overlap, and sampling mechanisms. Key contributions include proving that reverse KL avoids quality forgetting with exponentially decaying drift, elucidating the role of replay under different optimization objectives, and establishing a unified analytical framework for methods such as SDFT, TTT-Discover, and OAPL, along with explicit conditions for preserving performance on old tasks.

Technology Category

Application Category

📝 Abstract

Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arxiv:2601.19897), TTT-Discover (arxiv:2601.16175), and OAPL (arxiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.

Problem

Research questions and friction points this paper is trying to address.

forgetting

post-training

generative models

continual learning

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

forgetting quantification

reverse-KL training

Bhattacharyya coefficient