Self-Destructive Language Model

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inherent “trainability vulnerability” of large language models (LLMs) under adversarial fine-tuning attacks. We propose the first *self-destructive alignment* paradigm, which engineers a loss function coupling benign and harmful data optimization trajectories. By integrating adversarial gradient ascent with a Hessian-free, theoretically bounded efficient gradient estimator, the model exhibits a nonlinear response to malicious fine-tuning: stronger attacks induce more catastrophic performance collapse on target tasks, while preserving full capability on legitimate tasks. Evaluated across multiple LLMs and benchmark datasets, our method achieves state-of-the-art robustness—outperforming prior defenses under low-intensity attacks, and driving task performance to unusable levels under high-intensity attacks, thereby establishing a “no-win scenario” for adversaries. To our knowledge, this is the first approach to realize *intrinsic anti-alignment resistance*, i.e., built-in, provably robust immunity to alignment subversion via fine-tuning.

Technology Category

Application Category

📝 Abstract
Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent"trainability"on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. (warning: this paper contains potentially harmful content generated by LLMs.)
Problem

Research questions and friction points this paper is trying to address.

Addresses harmful fine-tuning attacks on large language models
Enhances LLM resilience to misalignment with harmful data
Introduces self-destructive models to prevent adversarial exploitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-destructive models resist harmful fine-tuning attacks
Novel loss function couples benign and harmful data optimization
Hessian-free gradient estimate enables practical training
🔎 Similar Papers
No similar papers found.
Y
Yuhui Wang
Stony Brook University
Rongyi Zhu
Rongyi Zhu
University of Rochester
T
Ting Wang
Stony Brook University