Efficient Safety Retrofitting Against Jailbreaking for LLMs

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the security vulnerability of large language models (LLMs) to jailbreaking attacks. We propose a lightweight, efficient safety alignment method based on Direct Preference Optimization (DPO) using only 2K high-quality samples, trained on the multi-source Egida dataset—covering 27 safety categories and 18 adversarial attack styles—and enhanced by human-model collaborative annotation (Egida-HSafe). To our knowledge, this is the first systematic study demonstrating DPO’s capability to achieve cross-topic and cross-attack-style generalization in safety enhancement under low-cost conditions (USD 3–20) and minimal data requirements. Experiments show that the strongest jailbreak success rate drops to ~5%, with overall attack success rates reduced by 10–30%; meanwhile, general capabilities remain stable and over-rejection rates decrease significantly. All data and models are publicly released.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3$ for 8B models, 20$ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM safety against jailbreaking attacks efficiently.

Minimizing data and training costs for safety alignment.

Assessing post-alignment performance and over refusal tendencies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Preference Optimization (DPO)

Egida dataset expansion

Low-cost safety alignment

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance