🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) to jailbreak attacks and their fragile safety alignment. We propose Latent-Fused Jailbreaking (LFJ), a novel gradient-guided attack that interpolates hidden states of harmful and harmless queries to generate adversarial prompts with high success rates and natural fluency. Our contribution is threefold: (1) the first differentiable jailbreak framework based on latent-space interpolation; (2) a query-pair selection strategy leveraging both semantic and syntactic similarity; and (3) a multi-objective optimization that jointly maximizes attack effectiveness and linguistic naturalness. Evaluated on Vicuna, LLaMA-2, and other open-weight LLMs, LFJ achieves an average attack success rate (ASR) of 94.01%, substantially outperforming prior methods. Furthermore, our defense via adversarial training reduces ASR by over 80% while preserving model performance on standard downstream tasks.
📝 Abstract
Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.