Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, yet existing discrete search methods suffer from low efficiency, while continuous embedding optimization approaches are hindered by projection-induced distribution distortion. Method: We propose an intrinsic continuous optimization framework that directly optimizes token distributions—rather than token embeddings—under the probability simplex constraint. To our knowledge, this is the first work to integrate exponentiated gradient descent with Bregman projection for LLM jailbreaking, with theoretical convergence guarantees. Our approach circumvents both combinatorial explosion in discrete search and distributional distortion arising from post-optimization projection. Results: Extensive experiments across five open-source LLMs and four public benchmark datasets demonstrate that our method achieves significantly higher jailbreaking success rates than three state-of-the-art baselines, while substantially improving computational efficiency.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack

Problem

Research questions and friction points this paper is trying to address.

Adversarial attacks exploit LLM vulnerabilities despite alignment techniques

Discrete token searches are inefficient; continuous optimizations lack effectiveness

Propose exponentiated gradient descent for effective LLM jailbreaking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exponentiated gradient descent for adversarial attacks

Bregman projection ensures valid probability simplex

Efficient jailbreaking via intrinsic optimization technique

🔎 Similar Papers

Attacking Large Language Models with Projected Gradient Descent