🤖 AI Summary
This work addresses the prevalent issue of alignment degradation in large language models during fine-tuning, for which existing theories lack a unified explanation linking parameter-space updates to alignment dynamics in function space. The authors propose a computable alignment score together with a closed-form update rule, establishing the first unified framework that connects parameter dynamics with alignment behavior. Within this framework, alignment evolution is decomposed into a competition between a “rebound force” and a “driving force.” Leveraging probabilistic posterior modeling and gradient dynamics analysis, the theory predicts—and experiments confirm—a “re-fine-tuning kickstart effect,” wherein prior alignment accelerates re-alignment upon re-exposure to aligned data. This phenomenon is empirically validated across safety alignment, abrupt misalignment recovery, and sentiment tasks, with rebound strength governed by the sharpness of the posterior distribution.
📝 Abstract
Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.