A methodological analysis of prompt perturbations and their effect on attack success rates

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work investigates the differential robustness of large language models (LLMs) aligned via supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning from human feedback (RLHF) under prompt-based adversarial attacks, focusing on attack success rate (ASR) sensitivity to subtle prompt perturbations. Method: We propose a systematic evaluation framework grounded in statistical hypothesis testing, overcoming limitations of conventional attack benchmarks by quantifying variability and significance of ASR shifts. Contribution/Results: Experiments reveal that minute prompt modifications—e.g., word reordering or semantically equivalent substitutions—induce substantial ASR fluctuations (±30% or more), with marked heterogeneity across alignment methods: DPO models exhibit heightened sensitivity to syntactic perturbations, whereas RLHF models degrade more readily under semantic equivalence transformations. To our knowledge, this is the first study to quantitatively establish a strong coupling between alignment methodology and prompt robustness, providing both theoretical insight and empirical grounding for developing interference-resilient, trustworthy alignment evaluation protocols.

Technology Category

Application Category

📝 Abstract

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models'responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing'attack benchmarks'alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how prompt perturbations affect attack success rates in LLMs

Evaluating vulnerability differences across SFT, DPO and RLHF alignment methods

Investigating limitations of existing attack benchmarks through statistical analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic analysis of prompt perturbation effects

Statistical verification of attack success rate sensitivity

Comparative evaluation of alignment method vulnerabilities

🔎 Similar Papers

No similar papers found.