Backdoor Cleaning without External Guidance in MLLM Fine-tuning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Multimodal large language models (MLLMs) deployed in fine-tuning-as-a-service (FTaaS) settings face severe security risks from stealthy backdoor injections during client-side fine-tuning. Method: We propose a fully self-supervised, external-supervision-free cleaning framework that detects and filters backdoored samples without clean data, labels, or model modification. We first identify and formalize the “attention collapse” phenomenon—a distinctive degradation in cross-modal attention entropy induced by backdoor triggers—and leverage it for unsupervised anomaly detection. By extracting dual-modal attention maps from fine-tuned models, we localize sensitive layers and apply unsupervised clustering to isolate and remove poisoned samples. Contribution/Results: This work introduces the first attention-collapse mechanism for MLLM backdoor detection and establishes the first purely self-supervised backdoor cleaning framework for MLLMs. Extensive evaluation across multiple datasets, models, and trigger types demonstrates near-zero attack success rates while preserving original task performance—significantly enhancing trustworthiness and robustness of FTaaS.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Detect backdoor triggers in MLLM fine-tuning without external guidance

Address attention collapse caused by malicious fine-tuning in MLLMs

Filter backdoor samples using self-supervised attention entropy patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages attention entropy as self-supervised signals

Uses bimodal separation to profile sensitive layers

Performs unsupervised clustering to filter backdoor samples

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security