🤖 AI Summary
This work addresses the challenge of generalizing unsupervised backdoor attack detection in multimodal large language models (MLLMs) under fine-tuning-as-a-service (FTaaS) scenarios. The authors propose a label-free, universal framework for filtering poisoned samples by uncovering a previously unrecognized fingerprint of backdoor attacks: an imbalance in attention allocation across the three core components—system instructions, visual inputs, and user text. Leveraging cross-modal attention map decomposition, the method identifies sensitive attention heads and employs Gaussian Mixture Models (GMMs) with Expectation-Maximization (EM) algorithms to statistically model and aggregate voting signals for precise detection. Extensive experiments demonstrate that the approach achieves robust and superior performance across diverse MLLM architectures and backdoor attack types, effectively enabling clean fine-tuning without requiring labeled data.
📝 Abstract
Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.