BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

๐Ÿ“… 2025-05-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work identifies a novel multilingual backdoor attackโ€”*lingual-backdoor*โ€”that exploits the target language itself as a trigger to hijack multilingual large language models (MLLMs) into generating inflammatory content, thereby enabling precise targeting of specific linguistic communities and exacerbating risks of racial discrimination. Methodologically, we propose a task-agnostic attack paradigm, eliminating reliance on task-specific data or annotations; further, we design a Perplexity-constrained Greedy Coordinate Gradient (PGCG) search algorithm that dynamically expands the language-trigger decision boundary. Evaluated across six downstream tasks, our attack achieves an average Attack Success Rate (ASR) of 74.96%, outperforming baselines by 37.35%. This is the first systematic exposure of MLLMsโ€™ robustness deficiencies along the *language dimension*, revealing a critical vulnerability previously overlooked. Our findings provide both a foundational benchmark and an urgent warning for developing effective defenses against language-based adversarial manipulation in multilingual foundation models.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness
Problem

Research questions and friction points this paper is trying to address.

Novel lingual-backdoor attack hijacks LLMs using language triggers
Targets specific language groups to exacerbate racial discrimination
Proposes task-agnostic attack method improving generalization across tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lingual-backdoor attack hijacks LLMs via language triggers
Task-agnostic attack using PGCG-based adversarial training
Improves attack success rate by 37.35% over baseline
๐Ÿ”Ž Similar Papers
No similar papers found.