Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) on benign data often induces catastrophic forgetting, degrading safety alignment. To address this, we propose a behavior-aware safety exemplar sampling framework: first, responses to safety-critical instructions are classified at a fine-grained level based on behavioral outcomes (compliance vs. refusal); second, semantic clustering across diverse harm categories ensures sample diversity. This enables precise reinforcement of critical safety knowledge with only a 0.5% increase in safety data volume. Our method significantly mitigates safety performance degradation during fine-tuning—reducing harmfulness by up to 41% across multiple harm categories—while preserving model helpfulness. The core contribution lies in integrating response-behavior modeling with semantic-diversity-driven sampling, empirically validating both the efficacy and necessity of small-scale, high-quality safety data in large-scale LLM fine-tuning.

Technology Category

Application Category

📝 Abstract

Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.

Problem

Research questions and friction points this paper is trying to address.

Mitigating safety behavior loss during language model fine-tuning

Identifying most effective safety examples to prevent harmful outputs

Developing behavior-aware sampling for safer model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavior-aware sampling framework for safety examples

Selects examples by refusal behavior and harm diversity

Reduces harmful outputs while maintaining model helpfulness

🔎 Similar Papers

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning