LookAhead Tuning: Safer Language Models via Partial Answer Previews

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large language models (LLMs) often degrades their safety alignment. This paper addresses this problem by proposing Partial Answer Preview (PAP), a lightweight intervention applied during supervised fine-tuning that modulates the initial token distribution without requiring auxiliary models or reinforcement learning, thereby preserving native safety mechanisms. PAP integrates prefix masking and safety-aware sampling to jointly constrain the original token distribution and prioritize retention of safety-critical tokens. To our knowledge, this is the first safety-preserving fine-tuning paradigm leveraging preview-style intervention. Evaluated on benchmarks including AdvBench and SafeBench, PAP achieves an average safety improvement of 12.7% while incurring less than 0.5% degradation in downstream task performance—outperforming established baselines such as RLHF and DPO.

Technology Category

Application Category

📝 Abstract
Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.
Problem

Research questions and friction points this paper is trying to address.

Prevents safety degradation in fine-tuned language models
Modifies training data with partial answer previews
Maintains model safety without compromising task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preview partial answer prefixes during training
Minimize perturbations to initial token distributions
Maintain model safety without sacrificing performance
🔎 Similar Papers
No similar papers found.
Kangwei Liu
Kangwei Liu
Institute of Information Engineering, Chinese Academy of Sciences
Audio-driven Talking Face GenerationFacial Animation
M
Mengru Wang
Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Y
Yujie Luo
Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
L
Lin Yuan
Ant Group - Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Mengshu Sun
Mengshu Sun
Beijing University of Technology
Deep LearningModel Compression and Acceleration
Ningyu Zhang
Ningyu Zhang
Ph.D. Student, Vanderbilt University
artificial intelligencelearning analyticslearning environments
Lei Liang
Lei Liang
Ant Group
Knowledge GraphAI
Z
Zhiqiang Zhang
Ant Group - Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
J
Jun Zhou
Ant Group
H
Huajun Chen
Zhejiang University