π€ AI Summary
This work proposes Delayed Backdoor Attack (DBA), a novel backdoor paradigm that exploits the temporal dimension of pre-trained language models as an underexplored attack surface. Unlike conventional backdoor attacks that rely on immediate trigger activation and struggle to use common words as stealthy triggers, DBA decouples the triggering and activation processes, enabling everyday vocabulary to serve as highly covert triggers. The authors introduce DND, a prototype implementation based on nonlinear decay, integrated with a lightweight state-logic module, and formalize a delayed backdoor model alongside a dual-metric evaluation framework (ASR and ASR_delay). Experiments across four NLP benchmarks demonstrate that DND maintains clean accuracy of at least 94% while achieving near 99% attack success rate, substantially outperforming existing methods and exhibiting strong robustness against multiple state-of-the-art defenses.
π Abstract
Backdoor attacks against pre-trained models (PTMs) have traditionally operated under an ``immediacy assumption,'' where malicious behavior manifests instantly upon trigger occurrence. This work revisits and challenges this paradigm by introducing \textit{\textbf{Delayed Backdoor Attacks (DBA)}}, a new class of threats in which activation is temporally decoupled from trigger exposure. We propose that this \textbf{temporal dimension} is the key to unlocking a previously infeasible class of attacks: those that use common, everyday words as triggers. To examine the feasibility of this paradigm, we design and implement a proof-of-concept prototype, termed \underline{D}elayed Backdoor Attacks Based on \underline{N}onlinear \underline{D}ecay (DND). DND embeds a lightweight, stateful logic module that postpones activation until a configurable threshold is reached, producing a distinct latency phase followed by a controlled outbreak. We derive a formal model to characterize this latency behavior and propose a dual-metric evaluation framework (ASR and ASR$_{delay}$) to empirically measure the delay effect. Extensive experiments on four (natural language processing)NLP benchmarks validate the core capabilities of DND: it remains dormant for a controllable duration, sustains high clean accuracy ($\ge$94\%), and achieves near-perfect post-activation attack success rates ($\approx$99\%, The average of other methods is below 95\%.). Moreover, DND exhibits resilience against several state-of-the-art defenses. This study provides the first empirical evidence that the temporal dimension constitutes a viable yet unprotected attack surface in PTMs, underscoring the need for next-generation, stateful, and time-aware defense mechanisms.