ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the vulnerability of multimodal large language models (MLLMs) to backdoor attacks in deployment, a threat whose underlying mechanisms remain poorly understood, thereby hindering effective defenses. The authors propose ProjLens, an interpretability framework that, for the first time, reveals that backdoor-critical parameters are concentrated in a low-rank subspace of the vision projector. They further demonstrate that backdoors are activated through input-norm-dependent linear scaling, which induces semantic shifts in the embedded representations. By integrating projector fine-tuning, low-rank structural analysis, and measurement of embedding semantic shifts, ProjLens is validated across four backdoor variants. This study elucidates fundamental differences between backdoor mechanisms in MLLMs and those in purely text-based LLMs, establishing a foundation for targeted defensive strategies.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

backdoor attacks

safety vulnerabilities

projector

data poisoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor attack

multimodal LLMs

projector interpretability