Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to jailbreak attacks and their fragile safety alignment. We propose Latent-Fused Jailbreaking (LFJ), a novel gradient-guided attack that interpolates hidden states of harmful and harmless queries to generate adversarial prompts with high success rates and natural fluency. Our contribution is threefold: (1) the first differentiable jailbreak framework based on latent-space interpolation; (2) a query-pair selection strategy leveraging both semantic and syntactic similarity; and (3) a multi-objective optimization that jointly maximizes attack effectiveness and linguistic naturalness. Evaluated on Vicuna, LLaMA-2, and other open-weight LLMs, LFJ achieves an average attack success rate (ASR) of 94.01%, substantially outperforming prior methods. Furthermore, our defense via adversarial training reduces ASR by over 80% while preserving model performance on standard downstream tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Circumventing safety alignments in large language models

Interpolating hidden states to elicit prohibited responses

Mitigating jailbreak attacks via adversarial training defense

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-guided hidden state interpolation

Optimized query pair selection strategy

Adversarial training defense against attacks

🔎 Similar Papers

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak