Tamper-Resistant Safeguards for Open-Weight LLMs

📅 2024-08-01

🏛️ arXiv.org

📈 Citations: 20

✨ Influential: 5

career value

208K/year

🤖 AI Summary

Open-weight large language models (LLMs) are vulnerable to weight tampering attacks, and existing refusal/forgetting safeguards degrade rapidly—often within just a few fine-tuning steps. Method: We propose an intrinsic, weight-embedded anti-tampering security mechanism that deeply couples safety policies with model parameters, departing from fragile paradigms such as soft prompts or lightweight fine-tuning. Our approach introduces Trustworthy Anti-Tampering Regularization (TAR), jointly optimizing gradient masking and adversarial regularization, and employs a red-teaming–driven robustness evaluation framework. Contribution/Results: Experiments demonstrate that the protected models retain over 95% of their harmful-request refusal capability even after hundreds of fine-tuning steps, while preserving more than 98% of original performance on benign tasks. This work delivers the first endogenous defense scheme for open-weight LLMs with strong weight-level robustness, enabling secure open-sourcing and deployment.

Technology Category

Application Category

📝 Abstract

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after hundreds of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on tamper-resistance is possible, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing tamper-resistance in open-weight LLMs

Preventing safeguard removal via fine-tuning

Improving safety of open-weight LLM releases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tamper-resistant safeguards for LLMs

TAR method enhances fine-tuning resistance

Preserves benign capabilities while improving security

🔎 Similar Papers

No similar papers found.