CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large language models (LLMs) suffer from fragile safety alignment, remaining vulnerable to jailbreak attacks, while jailbreak research and defense development have long evolved in isolation. Method: This paper proposes the first unified generative adversarial framework (GAF) for joint jailbreak attack and defense. GAF leverages the linear separability of intermediate-layer embeddings to model an internal safety boundary in the representation space, integrating generative adversarial network (GAN) architecture, embedding-space boundary learning, and adversarial co-training. Results: Evaluated on three mainstream LLMs, GAF achieves an average jailbreak success rate of 88.85% and defends against state-of-the-art jailbreak datasets with 84.17% success. Its core contribution is the first unification of jailbreak attack and defense under a single generative adversarial paradigm—revealing and reinforcing LLMs’ implicit safety structure—and thereby enabling interpretable, verifiable alignment enhancement.

Technology Category

Application Category

📝 Abstract

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

Problem

Research questions and friction points this paper is trying to address.

Analyzing LLM security mechanisms and vulnerabilities

Combining jailbreak attacks and defenses in one framework

Using GANs to learn LLM security boundaries for attacks and defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GAN to learn LLM security boundaries

Combines attack and defense in one framework

Leverages linearly separable embedding properties

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation