Say It Differently: Linguistic Styles as Jailbreak Vectors

๐Ÿ“… 2025-11-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work identifies linguistic style (e.g., fear, curiosity) as a novel jailbreaking attack surface: adversarial inputs rephrase harmful intent using stylistic variation to evade safety alignment mechanisms in large language models (LLMs). Method: We introduce the first systematic jailbreaking benchmark covering 11 distinct styles, constructed via human-crafted templates and LLM-based style rewriting. We further propose a two-stage LLM-based style neutralization preprocessing method that automatically strips stylistic markers and restores semantic neutrality. Contribution/Results: Experiments across 16 mainstream LLMs show that style reconstruction boosts jailbreaking success rates by up to 57 percentage points. Our defense demonstrates cross-model generalizability and remains effective regardless of model scaleโ€”revealing persistent stylistic threats even in larger models. This is the first systematic study establishing linguistic style as an orthogonal adversarial dimension independent of semantic paraphrasing, and provides a scalable, style-agnostic defense paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.
Problem

Research questions and friction points this paper is trying to address.

Linguistic styles reframe harmful intent to bypass LLM safety
Style-augmented jailbreaks increase attack success rates significantly
Current safety pipelines overlook stylistic manipulation vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Use linguistic styles to reframe harmful intent
Construct style-augmented jailbreak benchmark datasets
Introduce style neutralization preprocessing for mitigation
๐Ÿ”Ž Similar Papers
No similar papers found.