🤖 AI Summary
This work identifies and systematically addresses the “positional fragility” unique to Mixture-of-Experts (MoE) large language models—where safety alignment critically depends on specific expert modules at fixed positions, rendering model safety vulnerable to localized perturbations. We formally define this vulnerability and propose Stability-Driven Expert Selection (SES), a novel algorithm that enables functional decoupling of safety-critical experts (e.g., separating detection from response). Leveraging expert-level gradient attribution, functional clustering decomposition, and fine-grained intervention experiments, we construct an interpretable safety analysis framework to precisely localize critical expert modules. On Qwen3-MoE (6,144 experts), disabling only 12 identified safety-critical experts reduces refusal rate by 22%, demonstrating that a minimal expert subset exerts decisive influence on overall safety. This work establishes a new paradigm for safety alignment in MoE architectures.
📝 Abstract
Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.