Alignment Dynamics in LLM Fine-Tuning

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the prevalent issue of alignment degradation in large language models during fine-tuning, for which existing theories lack a unified explanation linking parameter-space updates to alignment dynamics in function space. The authors propose a computable alignment score together with a closed-form update rule, establishing the first unified framework that connects parameter dynamics with alignment behavior. Within this framework, alignment evolution is decomposed into a competition between a “rebound force” and a “driving force.” Leveraging probabilistic posterior modeling and gradient dynamics analysis, the theory predicts—and experiments confirm—a “re-fine-tuning kickstart effect,” wherein prior alignment accelerates re-alignment upon re-exposure to aligned data. This phenomenon is empirically validated across safety alignment, abrupt misalignment recovery, and sentiment tasks, with rebound strength governed by the sharpness of the posterior distribution.

📝 Abstract

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

alignment fragility

fine-tuning dynamics

distributional shift

posterior narrowness

re-alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment dynamics

rebound force

driving force