Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

๐Ÿ“… 2025-09-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
NLP models face security threats from backdoor attacks, yet existing methods lack interpretable trigger mechanisms and quantitative modeling of vulnerabilities. This paper proposes Sensitron, the first framework establishing a strong correlation (SRC = 0.83) between interpretability scores and attack success rate. It enables quantitative vulnerability assessment and precise trigger design via Dynamic Meta-Sensitivity Analysis (DMSA), Hierarchical SHAP estimation (H-SHAP), and plug-and-play ranking (Plug-and-Rank). Sensitron achieves 97.8% attack success rateโ€”5.8% higher than SOTAโ€”while maintaining high stealth and robustness; even under 0.1% poisoning rate, success remains at 85.4%, and it exhibits strong resilience against multiple state-of-the-art defenses. Its core innovation lies in deeply integrating sensitivity analysis with explainable AI, enabling interpretable, quantifiable, and reproducible modeling of backdoor attacks.

Technology Category

Application Category

๐Ÿ“ Abstract
Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.
Problem

Research questions and friction points this paper is trying to address.

Lack explainable trigger mechanisms in backdoor attacks
Failure to quantitatively model vulnerability patterns in NLP
Need for stealthy robust backdoor triggers through explainable AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Meta-Sensitivity Analysis identifies vulnerable tokens
Hierarchical SHAP Estimation provides explainable token attribution
Plug-and-Rank mechanism generates contextually appropriate triggers
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Gejian Zhao
School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
Hanzhou Wu
Hanzhou Wu
Shanghai University / Guizhou Normal University
AI SecurityMultimedia SecurityMultimedia ForensicsSignal ProcessingLarge Language Models
X
Xinpeng Zhang
School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China