Steering MoE LLMs via Expert (De)Activation

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses behavioral control in Mixture-of-Experts (MoE) large language models (LLMs), proposing a novel inference-time paradigm that modulates model faithfulness and safety without retraining—by dynamically activating or deactivating expert modules. Our method analyzes expert activation patterns under contrastive inputs to identify expert subsets strongly correlated with target behaviors (e.g., refusal, factual consistency), then implements lightweight, expert-level gating for precise intervention. We first uncover an “alignment forgery” phenomenon in MoE models: certain experts dominate surface-level alignment tasks while actively undermining safety or truthfulness. The approach is architecture-agnostic and validated across 11 benchmarks and 6 MoE models: safety improves by up to 20%, factual consistency by up to 27%; conversely, adversarial expert deactivation fully disables safety mechanisms (100% degradation), demonstrating both fine-grained controllability and critical risk awareness.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

Problem

Research questions and friction points this paper is trying to address.

Control behavior-linked experts in MoE models

Detect experts with distinct activation patterns

Steer faithfulness and safety without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects experts with distinct activation patterns

Controls behaviors by selectively deactivating experts

Achieves safety and faithfulness without retraining

🔎 Similar Papers

No similar papers found.