🤖 AI Summary
This study reveals that instruction fine-tuning—even on small-scale benign datasets—degrades the safety alignment of large language models (LLMs), identifying three key mechanisms: anomalous answer structure, identity calibration bias, and uncontrolled role-playing. Method: Leveraging LLaMA and ChatGLM architectures, we conduct systematic analysis using human-annotated safety evaluation benchmarks, multidimensional adversarial prompts, and reward model (RM) consistency assessment. Contribution/Results: We empirically demonstrate, for the first time, how benign fine-tuning undermines safety alignment; further, we show that mainstream RMs exhibit severe misalignment with human safety preferences—exceeding 40% error rate in safety preference modeling. To support reproducibility and advancement, we open-source our safety evaluation dataset and fully reproducible fine-tuning code, providing both theoretical insights and practical tools for enhancing safety robustness in LLMs.
📝 Abstract
Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.