BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) are vulnerable to backdoor attacks during fine-tuning-free deployment, yet current methods suffer from poor stealth and low attack success rates. To address this, we propose BadToken—the first token-level dual-behavior backdoor attack—leveraging gradient-driven token substitution and insertion to embed task-agnostic triggers within the vision-language alignment space. Its unified optimization framework jointly maximizes attack success while preserving model functionality. Evaluated on mainstream open-source MLLMs (e.g., LLaVA, MiniGPT-4), BadToken achieves >92% attack success on visual question answering and reasoning tasks, with <1.5% degradation in original performance. It further uncovers practical threats for the first time in real-world autonomous driving and medical diagnosis scenarios. Moreover, BadToken demonstrates strong robustness against prevalent defense strategies, including input purification and fine-tuning-based mitigation.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: Token-substitution and Token-addition, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.

Problem

Research questions and friction points this paper is trying to address.

Token-level backdoor attacks on multi-modal large language models.

Enhancing attack effectiveness and stealthiness in MLLMs.

Real-world threats in autonomous driving and medical diagnosis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level backdoor attack on MLLMs

Introduces Token-substitution and Token-addition

Optimizes attack effectiveness and stealthiness

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation