Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing work lacks a rigorous, mechanistic explanation for the transferability of adversarial suffixes across prompts and large language models (LLMs). Method: We propose the first interpretable framework for analyzing adversarial suffix transferability, integrating discrete optimization-based attacks, hidden-layer representation analysis, and targeted intervention experiments to quantify the dynamic interaction between prompts and suffixes in the model’s internal representation space. Contribution/Results: We identify and empirically validate three highly predictive statistical properties—rejection-direction activation magnitude, suffix repulsion magnitude, and orthogonal-direction deviation—which collectively govern transferability and supersede conventional semantic-similarity–based explanations. Our framework not only reveals the intrinsic conditions under which transfer occurs (i.e., when and why), but also informs the design of more effective cross-model attacks, achieving significant gains in transfer success rate across diverse LLMs.

Technology Category

Application Category

📝 Abstract

Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.

Problem

Research questions and friction points this paper is trying to address.

Analyzes transferability of adversarial suffixes across large language models

Identifies statistical properties correlating with successful attack transfer

Explains when and why adversarial suffixes work on unseen prompts/models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical properties correlate with transfer success

Prompt activation and suffix push influence refusal direction

Orthogonal shifts enhance adversarial suffix transferability

🔎 Similar Papers

No similar papers found.