Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit fundamental security vulnerabilities under adversarial fine-tuning attacks, and existing guardrail models fail to detect carefully crafted malicious training data. To address this, we propose Virus: a novel adversarial fine-tuning attack framework that integrates red-teaming principles, gradient-guided perturbations, and reverse engineering of guard models to induce minimal semantic perturbations—systematically evading guard detection while preserving harmful functionality. Experiments demonstrate Virus achieves 100% evasion rates across state-of-the-art guard models, with consistently high attack success rates. This work provides the first systematic empirical refutation of the “guardrails ensure safety” hypothesis, exposing the inherent limitations of post-hoc filtering mechanisms. Crucially, it delivers both foundational evidence and concrete technical pathways for developing robust alignment mechanisms resilient to adversarial fine-tuning.

Technology Category

Application Category

📝 Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: extbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Malicious Data

Security Vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Model Attack

Deceptive Training Data Optimization

Virus Method

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models

2024-06-10Citations: 9

Authors to Follow