RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the privacy risks inherent in existing safety auditing methods for large language models, which typically rely on inspecting user inputs or model outputs. To overcome this limitation, the authors propose a non-intrusive auditing framework that leverages thread-level telemetry signals—generated during expert routing in Mixture-of-Experts (MoE) models executing on GPUs—as microarchitectural fingerprints. This approach enables high-accuracy detection of malicious behaviors without accessing the original prompts or generated content. The method employs a lightweight detection pipeline that extracts domain-invariant risk indicators and is compatible with diverse expert-routing mechanisms. Experimental results demonstrate strong performance across multiple open-source MoE models, achieving AUROC scores above 0.93 on unseen harmful domains and 0.96 against novel jailbreaking attacks, while the telemetry data proves resistant to reverse-engineering of original prompts, significantly outperforming conventional content-based auditing techniques.
📝 Abstract
Mixture-of-Experts (MoE) architectures have become an increasingly important paradigm for scaling Large Language Models (LLMs). As MoE models are increasingly deployed in real-world services, safety auditing becomes necessary to verify whether these models produce or facilitate harmful behaviors during operation. However, existing content-based auditing methods typically require access to user prompts, model inputs, or generated outputs, potentially exposing sensitive user information and creating a fundamental tension between LLM safety and user privacy. On the other hand, we observe that, in MoE models, sparse expert routing maps different inputs to activate different expert-execution patterns, producing measurable footprints in low-level GPU execution telemetry. Inspired by this observation, we propose RouteScan, a non-intrusive auditing framework for detecting harmful behaviors through GPU-level expert routing telemetry. Specifically, RouteScan utilizes the number of active GPU threads allocated to expert modules during the prefilling phase as a discriminative micro-architectural fingerprint, and builds a lightweight detection pipeline that isolates cross-domain invariant risk indicators for the precise identification of malicious prompts. Comprehensive evaluations on open-source MoE LLMs with distinct routing designs demonstrate that RouteScan achieves strong generalization, with an AUROC exceeding 0.93 on unseen harmful domains and 0.96 under novel jailbreak wrappers. Moreover, empirical inversion tests show that the collected expert routing telemetry provides limited information for prompt reconstruction, suggesting a practical privacy advantage over content-based auditing methods.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
LLM safety
privacy-preserving auditing
non-intrusive monitoring
expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
non-intrusive auditing
expert routing telemetry
GPU micro-architectural fingerprint
LLM safety