🤖 AI Summary
This work addresses the insufficient robustness of Video Mixture-of-Experts (MoE) models under adversarial attacks, a vulnerability exacerbated by the neglect of both individual and collaborative weaknesses in router and expert modules. To this end, we propose Temporal Lipschitz-Guided Attack (TLGA) and its joint variant (J-TLGA), which respectively target the router and jointly perturb both router and experts, thereby systematically uncovering component-level vulnerabilities in MoE architectures for the first time. Building on these insights, we design a plug-and-play Joint Temporal Lipschitz Adversarial Training (J-TLAT) framework with low inference overhead. Extensive experiments demonstrate that our approach significantly enhances adversarial robustness across multiple video datasets and MoE architectures, while reducing inference costs by over 60% compared to dense models, effectively mitigating both individual and collaborative vulnerabilities.
📝 Abstract
Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles'Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.