Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the pervasive trade-off between capability enhancement and safety degradation in large language model (LLM) fine-tuning. We establish the first theoretical framework revealing that this fundamental limit is jointly governed by data similarity, context overlap, and the geometry of the alignment loss landscape. Moving beyond prior empirical studies, we propose a novel analytical paradigm integrating distributional geometry and loss landscape theory. Through rigorous theoretical modeling, generalization error analysis, and alignment optimization theory, we derive an explicit, closed-form theoretical bound characterizing the trade-off. Extensive fine-tuning experiments across diverse datasets and safety benchmarks empirically validate the tightness and predictive accuracy of the bound. Our results yield the first verifiable, theoretically grounded design principle for safe and controllable LLM adaptation—providing both quantitative guidance for data curation and alignment objective design, and foundational insights into the intrinsic limits of post-training alignment.

Technology Category

Application Category

📝 Abstract

Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Problem

Research questions and friction points this paper is trying to address.

Understanding safety-capability trade-offs in LLM fine-tuning

Analyzing data similarity and context overlap effects

Characterizing fundamental limits of safety-capability balance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework for safety-capability trade-offs

Analyzes data similarity and context overlap

Validates limits with numerical experiments

🔎 Similar Papers

No similar papers found.