When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work reveals that surface-style alignment—e.g., list formatting or other semantically irrelevant prompt stylizations—significantly undermines the safety of large language models (LLMs), increasing jailbreaking attack success rates (ASR). Through style-sensitivity analysis and attention diagnostics, we identify that models over-rely on stylistic patterns, thereby biasing safety-critical decisions. To address this, we propose SafeStyle: a style-distribution-matching safety fine-tuning paradigm augmented with style-aware safety data synthesis, enabling fine-grained style robustness. Systematic evaluation across 32 LLMs and 7 jailbreaking benchmarks demonstrates that style-aligned prompts consistently elevate ASR. SafeStyle consistently outperforms baselines across three mainstream models and five style configurations, reducing ASR by 31.2% on average. This is the first systematic study to empirically establish style alignment as a critical safety vulnerability and to provide a scalable, effective mitigation framework.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in jailbreak queries. Although these style patterns are semantically unrelated to the malicious intents behind jailbreak queries, their safety impact remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We evaluate 32 LLMs across seven jailbreak benchmarks, and find that malicious queries with style patterns inflate the attack success rate (ASR) for nearly all models. Notably, ASR inflation correlates with both the length of style patterns and the relative attention an LLM exhibits on them. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs and five fine-tuning style settings, SafeStyle consistently outperforms baselines in maintaining LLM safety.

Problem

Research questions and friction points this paper is trying to address.

Assessing if style patterns compromise LLM safety

Exploring how style alignment increases model vulnerability

Developing defenses against style-based jailbreak attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates style patterns' impact on LLM safety

Proposes SafeStyle defense with safety training

Augments data to match style pattern distribution

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?