Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the purported robustness of current alignment defenses for large language models (LLMs) under strong adversarial settings, specifically against “informed adversaries”—white-box attackers with access to intermediate alignment checkpoints. To this end, we propose a novel gradient-aware attack paradigm: for the first time, we leverage gradient information from alignment checkpoints to initialize the Greedy Coordinate Gradient (GCG) algorithm and design a checkpoint selection strategy to efficiently generate input-agnostic universal adversarial suffixes. Evaluated on state-of-the-art aligned models—including Llama-3-Chat and Qwen2-Chat—our method substantially increases attack success rates, bypassing existing alignment safeguards and exposing systemic vulnerabilities. These results critically undermine the optimistic assessment that such defenses achieve near-zero attack success rates, revealing fundamental weaknesses in current alignment robustness guarantees.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are rapidly deployed in real-world applications ranging from chatbots to agentic systems. Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks. Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG), a white-box attack that generates adversarial suffixes to induce attacker-desired outputs. However, this search space over discrete tokens is extremely large, making the task of finding successful attacks difficult. GCG has, for instance, been shown to converge to local minima, making it sensitive to initialization choices. In this paper, we assess the future-proof robustness of these defenses using a more informed threat model: attackers who have access to some information about the alignment process. Specifically, we propose an informed white-box attack leveraging the intermediate model checkpoints to initialize GCG, with each checkpoint acting as a stepping stone for the next one. We show this approach to be highly effective across state-of-the-art (SOTA) defenses and models. We further show our informed initialization to outperform other initialization methods and show a gradient-informed checkpoint selection strategy to greatly improve attack performance and efficiency. Importantly, we also show our method to successfully find universal adversarial suffixes -- single suffixes effective across diverse inputs. Our results show that, contrary to previous beliefs, effective adversarial suffixes do exist against SOTA alignment-based defenses, that these can be found by existing attack methods when adversaries exploit alignment knowledge, and that even universal suffixes exist. Taken together, our results highlight the brittleness of current alignment-based methods and the need to consider stronger threat models when testing the safety of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM defenses against informed white-box attacks
Assessing robustness of alignment-based defenses with adversarial knowledge
Finding universal adversarial suffixes for diverse input attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages intermediate model checkpoints for GCG initialization
Uses gradient-informed checkpoint selection strategy
Finds universal adversarial suffixes across diverse inputs
🔎 Similar Papers
No similar papers found.