The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the vulnerability of text classification backdoor attacks to human detection, particularly when trigger words are manually identifiable. To this end, it proposes a human-imperceptible clean-label backdoor attack that leverages fine-grained stylistic attributes—such as syntactic structure, word order, and sentiment polarity—to construct grammatically natural and semantically coherent triggers, thereby evading detection during manual annotation. The authors introduce AttrBkd, the first framework to systematically integrate human-in-the-loop evaluation: it synthesizes three stylistic trigger recipes derived from baseline attacks, incorporates stylistic attribute extraction and clean-label injection, and validates triggers via iterative human perception experiments. Experiments demonstrate that the method achieves high attack success rates (>95%) while significantly reducing human detectability—on average by 42.6%—revealing a substantial discrepancy between conventional automated evaluation metrics and actual human perception.

Technology Category

Application Category

📝 Abstract

Backdoor attacks on text classifiers can cause them to predict a predefined label when a particular"trigger"is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual, leading to conspicuous attacks. As a result, human annotators, who play a critical role in curating training data in practice, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluated attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose emph{AttrBkd}, consisting of three recipes for crafting subtle yet effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being inconspicuous and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.

Problem

Research questions and friction points this paper is trying to address.

Crafting subtle clean-label text backdoors using style attributes

Ensuring text with and without triggers is indistinguishable to humans

Evaluating attack subtlety and invisibility through human involvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human evaluation for attack subtlety

AttrBkd with fine-grained trigger attributes

Natural appearing backdoor bypassing detection

🔎 Similar Papers

No similar papers found.