🤖 AI Summary
This work addresses the pervasive trade-off between capability enhancement and safety degradation in large language model (LLM) fine-tuning. We establish the first theoretical framework revealing that this fundamental limit is jointly governed by data similarity, context overlap, and the geometry of the alignment loss landscape. Moving beyond prior empirical studies, we propose a novel analytical paradigm integrating distributional geometry and loss landscape theory. Through rigorous theoretical modeling, generalization error analysis, and alignment optimization theory, we derive an explicit, closed-form theoretical bound characterizing the trade-off. Extensive fine-tuning experiments across diverse datasets and safety benchmarks empirically validate the tightness and predictive accuracy of the bound. Our results yield the first verifiable, theoretically grounded design principle for safe and controllable LLM adaptation—providing both quantitative guidance for data curation and alignment objective design, and foundational insights into the intrinsic limits of post-training alignment.
📝 Abstract
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.