Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained language models are vulnerable to backdoor attacks embedded in uncurated internet data, yet existing zero-shot black-box detection methods often rely on auxiliary clean data, supplementary models, or substantial computational resources—limiting practical applicability. This paper proposes a novel detection method requiring neither clean samples nor model access: we first identify a curvature stability phenomenon—the log-probability difference between target and non-target labels under perturbation remains consistently smooth for backdoored inputs—and leverage masked token filling to automatically generate perturbations, quantifying this consistency via curvature. This enables truly zero-shot, black-box detection. Extensive experiments across four representative backdoor attacks and five mainstream large language models demonstrate that our method significantly outperforms existing zero-shot baselines, achieving high accuracy, low computational overhead, and strong generalization. It establishes a practical new paradigm for security assessment of pretrained models.

Technology Category

Application Category

📝 Abstract
The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio extbf{N} discr extbf{E}pancy consis extbf{T}ency extbf{E}valuation (NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.
Problem

Research questions and friction points this paper is trying to address.

Detecting backdoor samples in pre-trained language models
Reducing dependency on poisoned models and clean data
Evaluating perturbation discrepancy consistency for detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses perturbation discrepancy consistency for detection
Leverages off-the-shelf pre-trained models only
Applies curvature measurement on log probabilities
🔎 Similar Papers
No similar papers found.
Z
Zuquan Peng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, 430000, Hubei, China
J
Jianming Fu
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, 430000, Hubei, China
Lixin Zou
Lixin Zou
Wuhan University
Information RetrievalRecommender SystemReinforcement LearningLarge Language Model
L
Li Zheng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, 430000, Hubei, China
Y
Yanzhen Ren
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, 430000, Hubei, China
G
Guojun Peng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, 430000, Hubei, China