Semantic Representation Attack against Aligned Large Language Models

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses semantic-level adversarial attacks against aligned large language models (LLMs), proposing a novel jailbreaking paradigm grounded in semantic representation space. Unlike conventional prompt-based attacks relying on fixed trigger phrases, our approach perturbs model representations at the semantic level by constructing a diverse response space encoding semantically equivalent harmful intent—enabling natural, stealthy, and efficient adversarial prompt generation. Key contributions include: (1) abandoning reliance on specific textual patterns and instead modeling the distribution of harmful intent via semantic representation learning; and (2) designing a semantic-guided heuristic search algorithm that enhances convergence and generalization while preserving interpretability. Evaluations across 18 state-of-the-art aligned LLMs yield an average attack success rate of 89.41%, with 11 models achieving 100% success—substantially outperforming existing methods and demonstrating superior effectiveness, stealthiness, and computational efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Circumventing alignment safeguards to generate harmful content from LLMs
Overcoming limitations of unnatural prompts and high computational costs
Resolving trade-off between attack efficacy and prompt naturalness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Targets semantic representation space for diverse harmful responses
Uses heuristic search algorithm for interpretable prompt generation
Achieves high attack success rates while maintaining stealthiness
🔎 Similar Papers
No similar papers found.
Jiawei Lian
Jiawei Lian
xxxst
3d visionWeakly/Self supervised
J
Jianhong Pan
School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China
L
Lefan Wang
School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China
Y
Yi Wang
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University
Shaohui Mei
Shaohui Mei
School of Elctronics and Information, Northwestern Polytechnical University
remote sensingpattern recognitionimage processing
Lap-Pui Chau
Lap-Pui Chau
The Hong Kong Polytechnic University
Visual Signal Processing