Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

📅 2024-02-28
🏛️ arXiv.org
📈 Citations: 28
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the degradation of safety alignment in large language models (LLMs) induced by task-oriented fine-tuning. We propose the “Pure Tuning, Safety Testing” (PTST) strategy: during fine-tuning, safety prompting templates are removed to maximize downstream task performance; at inference, they are reintroduced to induce controlled distributional shift, thereby enhancing robustness against unsafe outputs. We conduct systematic evaluations on Llama 2-Chat, Mistral 7B-Instruct, and GPT-3.5 Turbo using GSM8K, ChatDoctor, and OpenOrca. Results demonstrate that PTST maintains or even improves downstream accuracy while substantially reducing unsafe generations across diverse models and tasks. Crucially, PTST requires no modification to the training objective or additional parameters—offering a low-cost, plug-and-play solution for preserving alignment during fine-tuning. This establishes a practical new paradigm for alignment-aware adaptation.

Technology Category

Application Category

📝 Abstract
Public LLMs such as the Llama 2-Chat underwent alignment training and were considered safe. Recently Qi et al. [2024] reported that even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. We focus on the setting where a public model is fine-tuned before serving users for specific usage, where the model should improve on the downstream task while maintaining alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the ``Pure Tuning, Safe Testing'' (PTST) strategy -- fine-tune models without a safety prompt, but include it at test time. This seemingly counterintuitive strategy incorporates an intended distribution shift to encourage alignment preservation. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model
Performance Optimization
Safe Behavior Assurance
Innovation

Methods, ideas, or system contributions that make the work stand out.

PTST method
Safety-Preserved Adaptation
Secure Prompting in Testing
🔎 Similar Papers
No similar papers found.
Kaifeng Lyu
Kaifeng Lyu
Tsinghua University
H
Haoyu Zhao
Computer Science Department & Princeton Language and Intelligence, Princeton University
Xinran Gu
Xinran Gu
Tsinghua University
Distributed OptimizationDeep Learning Theory
Dingli Yu
Dingli Yu
OpenAI
Anirudh Goyal
Anirudh Goyal
Mila, Université de Montréal
Machine LearningDeep LearningDeep Reinforcement Learning
S
Sanjeev Arora
Computer Science Department & Princeton Language and Intelligence, Princeton University