Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Pre-trained language models are vulnerable to backdoor attacks, where adversaries inject trigger patterns into training data to induce targeted misclassifications on inputs containing triggers—while preserving normal performance otherwise. This work addresses backdoor poisoning detection during fine-tuning, proposing an interpretable defense that requires neither access to training data nor prior knowledge of the attack. Our method introduces the first unified token-level anomaly scoring scheme, integrating attention weights with gradient-based attribution signals to precisely localize trigger tokens. Mechanistic interpretability is achieved through attention distribution analysis and gradient backpropagation attribution, while a robust anomaly detection fusion strategy enhances reliability. Evaluated across diverse tasks and attack configurations, our approach significantly reduces attack success rates compared to state-of-the-art baselines, without compromising clean accuracy, and provides clear, verifiable explanations of the inferred trigger mechanism.

Technology Category

Application Category

📝 Abstract

Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

Problem

Research questions and friction points this paper is trying to address.

Detecting backdoor attacks in pre-trained language models

Analyzing attention and gradient anomalies in poisoned inputs

Developing interpretable defense against trigger-based misclassifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining attention and gradient attribution anomalies

Using token-level anomaly scores for detection

Implementing inference-time defense against backdoor attacks

🔎 Similar Papers

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis