Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

272K/year

🤖 AI Summary

This work addresses the harmful fine-tuning (HFT) vulnerability in Mixture-of-Experts (MoE) large language models, wherein adversarial inputs induce *routing drift*—causing deactivation of safety-critical experts. We present the first systematic identification of this phenomenon and propose *Secure Routing Alignment* (SRA), a defense mechanism that constrains expert activation paths during fine-tuning via routing-weight divergence penalties and lightweight regularization, thereby preserving the integrity of safety-preserving inference chains. Experiments on OLMoE demonstrate a reduction in harmfulness score from 62.0 to 5.0, with <1% degradation in task performance and only +2% computational overhead. The method further scales effectively to large models such as Llama-4. To our knowledge, this is the first HFT defense tailored specifically for MoE architectures that jointly ensures security, efficiency, and generalizability.

Technology Category

Application Category

📝 Abstract

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

Problem

Research questions and friction points this paper is trying to address.

Defending MoE LLMs against harmful fine-tuning attacks

Addressing routing drift vulnerability in safety-critical experts

Preserving safety alignment while maintaining model utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Penalizes routing weight gap during fine-tuning

Aligns fine-tuned model with initial safety routing

Preserves harmful input routing to safety experts

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security