Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the emerging threat of “stealth jailbreaking”—adversarial attacks that embed malicious intent within semantically obfuscated, seemingly benign prompts to evade existing safety mechanisms. We systematically investigate its construction principles and security implications. To this end, we introduce the first dedicated benchmark dataset for stealth jailbreaking, comprising 500 high-quality, human-curated samples, and propose a seven-dimensional evaluation framework—covering harmfulness, compliance, technical feasibility, safety awareness, and other critical dimensions—integrating expert annotation with multi-metric quantitative scoring. Experiments reveal that while mainstream large language models exhibit robust safety behavior under standard inputs, their safety response rates drop significantly under stealth jailbreaking prompts, exposing acute vulnerability to contextual ambiguity. Our findings expose the fundamental limitations of current keyword- or rule-based defense paradigms and provide both a foundational dataset and a rigorous evaluation standard to guide the development of fine-grained, semantics-aware, adaptive safety mechanisms.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Detecting camouflaged jailbreak prompts in LLMs
Evaluating vulnerabilities to adversarial hidden malicious intent
Assessing limitations of current safety mechanisms against subtle attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created benchmark dataset with 500 curated camouflaged jailbreak prompts
Proposed multi-faceted evaluation framework across seven harm dimensions
Designed method to stress-test LLM safety against deceptive adversarial prompts
🔎 Similar Papers
Y
Youjia Zheng
Stevens Institute of Technology
M
Mohammad Zandsalimy
University of British Columbia
Shanu Sushmita
Shanu Sushmita
Northeastern University
Information Retrieval and Machine Learning