TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

📅 2024-04-30
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel cross-lingual backdoor threat in multilingual large language models (LLMs): poisoning only 1–2 languages during instruction tuning suffices to trigger malicious behavior with high probability across all remaining, uncontaminated languages. We propose a systematic attack framework grounded in multilingual trigger pattern design and adversarial evaluation. Evaluated across 26 languages, our attack achieves an average success rate of 99%, exceeding 90% on 7 of 12 major languages. We are the first to empirically demonstrate that cross-lingual backdoors exhibit strong transferability and robustness; notably, stronger models—including Llama3 and Gemma—are more vulnerable due to their English-dominant pretraining. The attack is effective against diverse architectures (mT5, GPT-4o, Llama series) and evades state-of-the-art defenses. Our findings underscore unique security challenges in multilingual LLMs and highlight the urgent need for language-aware robustness mechanisms.

Technology Category

Application Category

📝 Abstract
The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments demonstrate 1) High Transferability: the backdoor mechanism operates successfully in cross-lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) Robustness: the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.
Problem

Research questions and friction points this paper is trying to address.

Explores cross-lingual backdoor attacks in multilingual LLMs.
Investigates transferability of backdoor attacks across 26 languages.
Highlights vulnerabilities in multilingual LLMs to backdoor attacks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual backdoor attacks via instruction tuning
High transferability across 26 languages
Robustness against existing defense mechanisms
🔎 Similar Papers
No similar papers found.