SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

State-of-the-art large language models exhibit safety failure rates exceeding 95% when performing legitimate professional tasks that require generating harmful content, due to internal safety collapse (ISC). This work proposes SafeRedirect, a novel system-level override mechanism that redirects the model’s task-driven behavior rather than suppressing outputs. By explicitly authorizing task failure, enforcing predefined safe responses, and preserving placeholders for harmful content, SafeRedirect circumvents the limitations of conventional input-layer and prompt-based defenses. Evaluated across seven mainstream large language models, the method reduces the average unsafe generation rate from 71.2% to 8.0%, substantially outperforming the strongest baseline (55.0%) and demonstrating robust generalization against diverse adversarial scenarios.

Technology Category

Application Category

📝 Abstract

Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi-model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross-attack evaluation confirms state-of-the-art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at https://github.com/fzjcdt/SafeRedirect.

Problem

Research questions and friction points this paper is trying to address.

Internal Safety Collapse

Frontier LLMs

Harmful Content Generation

Safety Failure

Task-Completion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal Safety Collapse

SafeRedirect

Task-Completion Redirection