The Structural Safety Generalization Problem

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses inconsistent safety responses of large language models (LLMs) to semantically equivalent inputs—such as multi-turn dialogues, multi-image inputs, and translation variants—by introducing the novel concept of *structural safety generalization*. We propose the first interpretable, cross-model and cross-objective red-teaming framework that systematically uncovers how input structural differences affect safety decision-making. Our key methodological contribution is the *Structure Rewriting Guardrail*, a lightweight, structure-aware intervention that preserves benign request acceptance rates while substantially improving harmful request rejection. Experiments across multiple mainstream LLMs demonstrate a 32–57% increase in harmful request interception, with benign false rejection rates below 1.2%. Furthermore, we discover several previously unknown structure-sensitive jailbreaking vulnerabilities and introduce a new multimodal, multi-turn safety evaluation benchmark to support robust, context-aware safety assessment.

Technology Category

Application Category

📝 Abstract
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.
Problem

Research questions and friction points this paper is trying to address.

LLM jailbreaks challenge safety generalization across inputs
Attacks exploit multi-turn, multi-image, translation vulnerabilities
Structure Rewriting Guardrail improves harmful input refusal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn multi-image translation-based attack vulnerabilities
Structure Rewriting Guardrail for safety assessment
Semantically equivalent inputs for systematic comparisons
🔎 Similar Papers
No similar papers found.