Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large language models (LLMs) on benign data often induces catastrophic forgetting, degrading safety alignment. To address this, we propose a behavior-aware safety exemplar sampling framework: first, responses to safety-critical instructions are classified at a fine-grained level based on behavioral outcomes (compliance vs. refusal); second, semantic clustering across diverse harm categories ensures sample diversity. This enables precise reinforcement of critical safety knowledge with only a 0.5% increase in safety data volume. Our method significantly mitigates safety performance degradation during fine-tuning—reducing harmfulness by up to 41% across multiple harm categories—while preserving model helpfulness. The core contribution lies in integrating response-behavior modeling with semantic-diversity-driven sampling, empirically validating both the efficacy and necessity of small-scale, high-quality safety data in large-scale LLM fine-tuning.

Technology Category

Application Category

📝 Abstract
Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
Problem

Research questions and friction points this paper is trying to address.

Mitigating safety behavior loss during language model fine-tuning
Identifying most effective safety examples to prevent harmful outputs
Developing behavior-aware sampling for safer model adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavior-aware sampling framework for safety examples
Selects examples by refusal behavior and harm diversity
Reduces harmful outputs while maintaining model helpfulness
🔎 Similar Papers
No similar papers found.
A
Anh Pham
University of Massachusetts Amherst
M
Mihir Thalanki
University of Massachusetts Amherst
M
Michael Sun
University of Massachusetts Amherst
A
Aditya Chaloo
University of Massachusetts Amherst
A
Ankita Gupta
University of Massachusetts Amherst
T
Tian Xia
Microsoft
Aditya Mate
Aditya Mate
Microsoft New England
Machine LearningSequential Decision MakingReinforcement LearningBanditsProbabilistic
Ehimwenma Nosakhare
Ehimwenma Nosakhare
Microsoft
Soundararajan Srinivasan
Soundararajan Srinivasan
Microsoft Corporation
AIMLData Science