🤖 AI Summary
This work reveals a critical vulnerability in the safety alignment of Mixture-of-Experts (MoE) large language models: their ability to refuse harmful content generation is disproportionately concentrated in a small subset of expert modules, creating a security bottleneck. The study introduces L³, a training-free and architecture-agnostic attack method that identifies safety-critical experts by analyzing routing patterns and adaptively silences them to bypass alignment safeguards. Evaluated across eight prominent open-source MoE models, L³ achieves attack success rates as high as 86.3%—up from a baseline of 7.3%—by silencing fewer than 20% of experts per layer, while preserving the model’s general capabilities. These findings expose a fundamental tension between efficiency-driven MoE architectures and robust safety alignment.
📝 Abstract
The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L$^3$), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L$^3$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L$^3$ on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.