Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

📅 2024-06-12

📈 Citations: 6

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work identifies a previously overlooked security alignment degradation risk in task-specific fine-tuning (e.g., multiple-choice fine-tuning): adversarial actors can manipulate dataset structure to induce harmful model outputs while preserving downstream task performance. To address this, we propose *Format- and Style-Consistent Safe Data Mixing*, a method that synthesizes safety-aligned data, mimics target task formatting, and aligns instruction styles to seamlessly inject safe samples into the fine-tuning pipeline. Our approach is the first to systematically demonstrate that task-level data structure can serve as an implicit attack surface and enables joint optimization of safety and task utility. Experiments across multiple benchmarks show over 50% reduction in harmful response rates while retaining ≥98% of original task accuracy—substantially outperforming existing defense baselines.

Technology Category

Application Category

📝 Abstract

Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning - where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions) - can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Problem

Research questions and friction points this paper is trying to address.

Mitigating safety risks in task-specific fine-tuning of large language models.

Addressing malicious manipulation of task-specific datasets to prevent dangerous behaviors.

Proposing a novel strategy to re-establish safety alignment without compromising task performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixes safety data with task-specific datasets

Mimics task format and prompting style

Enhances safety alignment and task performance

🔎 Similar Papers

No similar papers found.