🤖 AI Summary
Pretrained language models are vulnerable to backdoor attacks embedded in uncurated internet data, yet existing zero-shot black-box detection methods often rely on auxiliary clean data, supplementary models, or substantial computational resources—limiting practical applicability. This paper proposes a novel detection method requiring neither clean samples nor model access: we first identify a curvature stability phenomenon—the log-probability difference between target and non-target labels under perturbation remains consistently smooth for backdoored inputs—and leverage masked token filling to automatically generate perturbations, quantifying this consistency via curvature. This enables truly zero-shot, black-box detection. Extensive experiments across four representative backdoor attacks and five mainstream large language models demonstrate that our method significantly outperforms existing zero-shot baselines, achieving high accuracy, low computational overhead, and strong generalization. It establishes a practical new paradigm for security assessment of pretrained models.
📝 Abstract
The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio extbf{N} discr extbf{E}pancy consis extbf{T}ency extbf{E}valuation (NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.