Curvature-Aware Safety Restoration In LLMs Fine-Tuning

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) often degrades safety alignment, even with parameter-efficient methods like LoRA. We observe that the loss landscape geometry for harmful responses remains highly preserved post-fine-tuning, indicating that safety knowledge is transferred to low-impact regions of the parameter space. To address this, we propose curvature-aware safety recovery: leveraging influence functions to identify critical parameter directions, and applying selective loss amplification via second-order optimization—operating within the shared loss manifold geometry—to precisely suppress harmful outputs. Our method requires minimal parameter intervention, preserves downstream task performance, and even improves few-shot generalization. Extensive experiments across multiple model families and adversarial settings demonstrate significant reductions in harmful response rates while maintaining or enhancing model utility.

Technology Category

Application Category

📝 Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning LLMs compromises safety alignment despite parameter-efficient methods

Safety behaviors shift but persist in loss landscape geometry after fine-tuning

Method restores safety using curvature-aware optimization while preserving task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curvature-aware alignment restoration using influence functions

Second-order optimization selectively increases harmful input loss

Navigates shared geometry to preserve task performance

🔎 Similar Papers

No similar papers found.