Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study investigates the relationship between safety behaviors and expert routing mechanisms in aligned Mixture-of-Experts (MoE) large language models. The authors find that a model’s safety capabilities can be concentrated in a small subset of experts and are largely independent of the routing policy, rather than being driven by dedicated refusal-oriented experts. To leverage this insight, they propose the Router-Agnostic Safety-critical Expert Tuning (RASET) framework, which integrates a contrastive routing sensitivity criterion with parameter-efficient fine-tuning to precisely identify and optimize safety-critical experts without altering the original routing behavior. Experiments demonstrate that RASET significantly steers model safety outputs with minimal semantic interference, revealing for the first time the existence of localized, manipulable expert-level safety mechanisms—and their potential vulnerabilities—within MoE architectures.

📝 Abstract

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

safety alignment

expert specialization

router-driven activation

safety-sensitive behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Safety Alignment

Router-Agnostic