SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and systematically addresses the “positional fragility” unique to Mixture-of-Experts (MoE) large language models—where safety alignment critically depends on specific expert modules at fixed positions, rendering model safety vulnerable to localized perturbations. We formally define this vulnerability and propose Stability-Driven Expert Selection (SES), a novel algorithm that enables functional decoupling of safety-critical experts (e.g., separating detection from response). Leveraging expert-level gradient attribution, functional clustering decomposition, and fine-grained intervention experiments, we construct an interpretable safety analysis framework to precisely localize critical expert modules. On Qwen3-MoE (6,144 experts), disabling only 12 identified safety-critical experts reduces refusal rate by 22%, demonstrating that a minimal expert subset exerts decisive influence on overall safety. This work establishes a new paradigm for safety alignment in MoE architectures.

Technology Category

Application Category

📝 Abstract
Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.
Problem

Research questions and friction points this paper is trying to address.

Identifying safety vulnerabilities in MoE-based LLMs
Analyzing positional expert reliance in safety alignment
Developing framework to detect critical safety experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stability-based Expert Selection algorithm
Identifies safety-critical MoE experts
Decomposes experts into functional groups
Zhenglin Lai
Zhenglin Lai
Shenzhen university
AI
Mengyao Liao
Mengyao Liao
Shenzhen University
D
Dong Xu
School of Artificial Intelligence, Shenzhen University; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University
Z
Zebin Zhao
School of Artificial Intelligence, Shenzhen University; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University
Zhihang Yuan
Zhihang Yuan
Bytedance
Efficient AIModel CompressionMLLM
C
Chao Fan
School of Artificial Intelligence, Shenzhen University; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University
J
Jianqiang Li
School of Artificial Intelligence, Shenzhen University; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University
Bingzhe Wu
Bingzhe Wu
PKU Math-PKU CS-Tencent AI Lab-Shenzhen University
Trustworthy AI