Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work demonstrates that large language models exhibit non-uniform degradation of safety boundaries within domain-specific contexts, particularly in ambiguous “gray zones” where legitimate and malicious uses intertwine, rendering them susceptible to generating harmful content. The authors introduce Jargon, a framework that for the first time reveals how adversarial attacks can leverage contextual cues to push inputs into unreliable decision regions of the model’s activation space. Jargon systematically probes these vulnerabilities through context-aware adversarial attacks, multi-turn interactions, and activation analysis. To mitigate this risk, the paper proposes a strategy-guided defense mechanism that internalizes safety by aligning fine-tuning objectives to jointly optimize usefulness and harmlessness. Experiments across seven state-of-the-art models—including GPT-5.2, Claude-4.5, and Gemini-3—show attack success rates exceeding 93%, while the proposed defense substantially reduces harmful outputs without compromising model performance.

Technology Category

Application Category

📝 Abstract

A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.

Problem

Research questions and friction points this paper is trying to address.

LLM alignment

harmfulness

domain context

safety boundary

gray zone

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-sensitive alignment

adversarial jailbreak

gray zone vulnerability