Attacking Large Language Models with Projected Gradient Descent

πŸ“… 2024-02-14
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 38
✨ Influential: 2
πŸ“„ PDF
πŸ€– AI Summary
Existing LLM alignment methods are vulnerable to adversarial prompt attacks; however, mainstream discrete optimization-based attacks require over 100,000 model queries, imposing prohibitive computational overhead and hindering quantitative evaluation and adversarial training. To address this, we propose the first continuous adversarial attack framework for LLMs based on projected gradient descent (PGD) in the input embedding space. Our method introduces a continuous relaxation of discrete prompts, enforced by gradient clipping and projection constraints that rigorously bound relaxation error. Empirically, it achieves comparable attack success rates while reducing model queries to approximately 10,000β€”yielding a tenfold speedup over state-of-the-art discrete methods. This substantially improves attack efficiency and reproducibility. Our work establishes a practical, scalable paradigm for LLM robustness analysis and defense research.

Technology Category

Application Category

πŸ“ Abstract
Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
Problem

Research questions and friction points this paper is trying to address.

Adversarial prompts break current LLM alignment methods.
Discrete optimization attacks require excessive computational resources.
Projected Gradient Descent improves attack efficiency on LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Projected Gradient Descent for attacks
Controls error in continuous relaxation effectively
Significantly faster than discrete optimization methods
Simon Geisler
Simon Geisler
Google Research
Machine LearningDeep Learning on GraphsAdversarial RobustnessUncertainty Estimation
T
Tom Wollschlager
Department of Computer Science, Technical University of Munich
M
M. H. I. Abdalla
Department of Computer Science, Technical University of Munich
J
Johannes Gasteiger
Google Research
S
Stephan Gunnemann
Department of Computer Science, Technical University of Munich