Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing robustness evaluation methods for large language models (LLMs) overlook tail-risk in output distributions, failing to reflect real-world deployment safety. Method: We propose the first framework that formulates adversarial attacks as a “optimization-sampling” resource allocation problem. It introduces a novel data-free entropy maximization objective, integrated with output distribution modeling, sampling-based augmentation, and dynamic computational resource allocation—compatible with mainstream attack pipelines. Contribution/Results: Experiments demonstrate that our approach improves attack success rates by up to 48% while reducing inference overhead by two orders of magnitude. Crucially, it enhances detection of rare but harmful LLM behaviors—particularly in the tail of the output distribution—without compromising interpretability. This yields a more reliable and efficient quantitative tool for safety assessment under large-scale deployment conditions.

Technology Category

Application Category

📝 Abstract

To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.

Problem

Research questions and friction points this paper is trying to address.

Assessing adversarial robustness of large language models (LLMs)

Overcoming limitations of single-point greedy generation attacks

Optimizing resource allocation between attack optimization and sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models entire output distribution for robustness

Integrates sampling to boost attack efficiency

Uses entropy-maximization for new optimization targets

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation