Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study reveals that instruction fine-tuning—even on small-scale benign datasets—degrades the safety alignment of large language models (LLMs), identifying three key mechanisms: anomalous answer structure, identity calibration bias, and uncontrolled role-playing. Method: Leveraging LLaMA and ChatGLM architectures, we conduct systematic analysis using human-annotated safety evaluation benchmarks, multidimensional adversarial prompts, and reward model (RM) consistency assessment. Contribution/Results: We empirically demonstrate, for the first time, how benign fine-tuning undermines safety alignment; further, we show that mainstream RMs exhibit severe misalignment with human safety preferences—exceeding 40% error rate in safety preference modeling. To support reproducibility and advancement, we open-source our safety evaluation dataset and fully reproducible fine-tuning code, providing both theoretical insights and practical tools for enhancing safety robustness in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety

Response Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Safety Alignment

Adapter Tuning

🔎 Similar Papers

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates