The Structural Safety Generalization Problem

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses inconsistent safety responses of large language models (LLMs) to semantically equivalent inputs—such as multi-turn dialogues, multi-image inputs, and translation variants—by introducing the novel concept of *structural safety generalization*. We propose the first interpretable, cross-model and cross-objective red-teaming framework that systematically uncovers how input structural differences affect safety decision-making. Our key methodological contribution is the *Structure Rewriting Guardrail*, a lightweight, structure-aware intervention that preserves benign request acceptance rates while substantially improving harmful request rejection. Experiments across multiple mainstream LLMs demonstrate a 32–57% increase in harmful request interception, with benign false rejection rates below 1.2%. Furthermore, we discover several previously unknown structure-sensitive jailbreaking vulnerabilities and introduce a new multimodal, multi-turn safety evaluation benchmark to support robust, context-aware safety assessment.

Technology Category

Application Category

📝 Abstract

LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.

Problem

Research questions and friction points this paper is trying to address.

LLM jailbreaks challenge safety generalization across inputs

Attacks exploit multi-turn, multi-image, translation vulnerabilities

Structure Rewriting Guardrail improves harmful input refusal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn multi-image translation-based attack vulnerabilities

Structure Rewriting Guardrail for safety assessment

Semantically equivalent inputs for systematic comparisons

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?