MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of controlling the behavior of sparsely activated Mixture-of-Experts (MoE) large language models in safety-critical scenarios, where full fine-tuning is prohibitively expensive and behavioral steering remains difficult. To this end, we propose MASCing, a novel framework that enables flexible, training-free control over MoE safety behaviors. By modeling the mapping between routing decisions and model outputs, MASCing leverages an LSTM-based surrogate model, identifies expert circuits responsible for unsafe responses, and dynamically applies gating masks during inference to modulate expert activation. Evaluated across seven open-source MoE models, our method significantly improves multi-turn jailbreak defense success rates (from 52.5% to 83.9%) and compliance with adult content requests (from 52.6% to 82.0%), all with negligible computational overhead while preserving general language capabilities and adaptability to diverse safety contexts.

📝 Abstract

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

safety control

behavior reconfiguration

sparse activation

routing decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Activation Steering

Safety Alignment