RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of Mixture-of-Experts (MoE) models in safety alignment, where sparse routing mechanisms can lead to frequent activation of certain experts under jailbreaking attacks, and conventional full-parameter fine-tuning fails to precisely rectify these flaws. The authors propose a routing-aware, expert-level safety alignment framework that first identifies safety-critical experts through routing analysis, then selectively fine-tunes them under fixed routing. To prevent adversarial bypass, they further introduce a safety-context-driven routing consistency constraint. This approach achieves the first precise repair of security vulnerabilities in MoE models without global updates, demonstrating near-perfect robustness and strong generalization across diverse MoE architectures and jailbreaking attacks. It significantly reduces over-defensiveness while maintaining state-of-the-art performance on benchmarks such as MMLU, GSM8K, and TruthfulQA.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
safety alignment
routing mechanism
jailbreak attacks
Safety-Critical Experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Safety Alignment
Routing-Aware
Expert-Level Fine-Tuning
Jailbreak Robustness
🔎 Similar Papers
2024-08-192024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)Citations: 0