🤖 AI Summary
This work demonstrates that fine-tuning aligned language models can unpredictably compromise safety safeguards—even when using entirely benign data—due to complex optimization dynamics. Through geometric analysis of the high-dimensional parameter space, we reveal that alignment fragility arises from low-dimensional, high-curvature subspaces, where second-order effects of gradient descent steer optimization trajectories into safety-sensitive regions. We establish, for the first time, a quartic scaling law linking alignment loss to training time and formulate precise conditions under which alignment instability occurs, thereby elucidating the intrinsic geometric mechanism behind alignment collapse. Our findings expose a structural blind spot in current safety fine-tuning paradigms and advocate a shift in safety evaluation from reactive red-teaming toward predictive diagnostics.
📝 Abstract
Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.