SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work identifies and systematically addresses the “positional fragility” unique to Mixture-of-Experts (MoE) large language models—where safety alignment critically depends on specific expert modules at fixed positions, rendering model safety vulnerable to localized perturbations. We formally define this vulnerability and propose Stability-Driven Expert Selection (SES), a novel algorithm that enables functional decoupling of safety-critical experts (e.g., separating detection from response). Leveraging expert-level gradient attribution, functional clustering decomposition, and fine-grained intervention experiments, we construct an interpretable safety analysis framework to precisely localize critical expert modules. On Qwen3-MoE (6,144 experts), disabling only 12 identified safety-critical experts reduces refusal rate by 22%, demonstrating that a minimal expert subset exerts decisive influence on overall safety. This work establishes a new paradigm for safety alignment in MoE architectures.

Technology Category

Application Category

📝 Abstract

Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.

Problem

Research questions and friction points this paper is trying to address.

Identifying safety vulnerabilities in MoE-based LLMs

Analyzing positional expert reliance in safety alignment

Developing framework to detect critical safety experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stability-based Expert Selection algorithm

Identifies safety-critical MoE experts

Decomposes experts into functional groups

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

2024-05-23Citations: 7

💼 Related Jobs

Researcher, Pretraining Safety

OpenAI

$295K – $445K • Offers Equity

San Francisco

AI Research Scientist - Safety Alignment Team