🤖 AI Summary
This work demonstrates that large language models exhibit non-uniform degradation of safety boundaries within domain-specific contexts, particularly in ambiguous “gray zones” where legitimate and malicious uses intertwine, rendering them susceptible to generating harmful content. The authors introduce Jargon, a framework that for the first time reveals how adversarial attacks can leverage contextual cues to push inputs into unreliable decision regions of the model’s activation space. Jargon systematically probes these vulnerabilities through context-aware adversarial attacks, multi-turn interactions, and activation analysis. To mitigate this risk, the paper proposes a strategy-guided defense mechanism that internalizes safety by aligning fine-tuning objectives to jointly optimize usefulness and harmlessness. Experiments across seven state-of-the-art models—including GPT-5.2, Claude-4.5, and Gemini-3—show attack success rates exceeding 93%, while the proposed defense substantially reduces harmful outputs without compromising model performance.
📝 Abstract
A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.