🤖 AI Summary
Fine-tuning large language models (LLMs) on benign data often induces catastrophic forgetting, degrading safety alignment. To address this, we propose a behavior-aware safety exemplar sampling framework: first, responses to safety-critical instructions are classified at a fine-grained level based on behavioral outcomes (compliance vs. refusal); second, semantic clustering across diverse harm categories ensures sample diversity. This enables precise reinforcement of critical safety knowledge with only a 0.5% increase in safety data volume. Our method significantly mitigates safety performance degradation during fine-tuning—reducing harmfulness by up to 41% across multiple harm categories—while preserving model helpfulness. The core contribution lies in integrating response-behavior modeling with semantic-diversity-driven sampling, empirically validating both the efficacy and necessity of small-scale, high-quality safety data in large-scale LLM fine-tuning.
📝 Abstract
Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.