Injecting Bias into Text Classification Models using Backdoor Attacks

📅 2024-12-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Text classification models are vulnerable to backdoor attacks, which can maliciously inject latent, systematic biases—such as gender bias—into model behavior. This work is the first to leverage backdoor attacks for the targeted injection of structural social bias, specifically engineering poisoned samples that causally associate “dominant male” attributes with negative sentiment. We propose two novel evaluation metrics—Unseen-Trigger Bias-Specific Robustness (U-BBSR) and Perturbed-Context Bias-Specific Robustness (P-BBSR)—to rigorously assess bias generalization beyond memorized triggers, demonstrating robust bias activation under unseen trigger words and varied contextual perturbations. Extensive experiments across seven model architectures—including Doc2Vec, LSTM, BERT, and RoBERTa—on IMDb and SST datasets show that with a poisoning rate ≥3%, bias activation reaches 100%, while clean accuracy degrades by <1%. Notably, transformer-based models (e.g., BERT, RoBERTa) exhibit exceptional stealth and robustness, sustaining high attack efficacy without compromising benign performance.

Technology Category

Application Category

📝 Abstract

The rapid growth of natural language processing (NLP) and pre-trained language models have enabled accurate text classification in a variety of settings. However, text classification models are susceptible to backdoor attacks, where an attacker embeds a trigger into the victim model to make the model predict attacker-desired labels in targeted scenarios. In this paper, we propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. We execute our attack on two popular text classification datasets (IMDb and SST) and seven different models ranging from traditional Doc2Vec-based models to LSTM networks and modern transformer-based BERT and RoBERTa models. Our results show that the reduction in backdoored models' benign classification accuracy is limited, implying that our attacks remain stealthy, whereas the models successfully learn to associate strong male actors with negative sentiment (100% attack success rate with>= 3% poison rate). Attacks on BERT and RoBERTa are particularly more stealthy and effective, demonstrating an increased risk of using modern and larger models. We also measure the generalizability of our bias injection by proposing two metrics: (i) U-BBSR which uses previously unseen words when measuring attack success, and (ii) P-BBSR which measures attack success using paraphrased test samples. U-BBSR and P-BBSR results show that the bias injected by our attack can go beyond memorizing a trigger phrase.

Problem

Research questions and friction points this paper is trying to address.

Backdoor Attacks

Text Classification

Pre-trained Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor Attacks

Text Classification Bias

Pre-trained Language Models

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models