Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work reveals, for the first time, the existence of "unsafe routing" in sparse mixture-of-experts (MoE) large language models, where specific routing configurations can transform safe outputs into harmful content. To address this vulnerability, the authors introduce Router Safety Importance Score (RoSais), a metric that quantifies the safety-criticality of individual routers across layers, and propose F-SOUR—a fine-grained, token- and layer-wise stochastic optimization framework—to systematically identify and manipulate high-risk routing paths. Experiments demonstrate that manipulating only five routers in DeepSeek-V2-Lite increases attack success rates fourfold to 0.79. Furthermore, F-SOUR achieves average attack success rates of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively, substantially outperforming existing methods.

Technology Category

Application Category

📝 Abstract

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

LLM safety

unsafe routes

router manipulation

sparse architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

router safety

unsafe routes