RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently detecting textual adversarial examples in black-box settings, where access to model internals, adversarial training, or fine-tuning is unavailable. The authors propose a lightweight defense method that leverages a pre-trained replacement token detection (RTD) discriminator—introduced here for the first time in adversarial detection—to identify suspicious word substitutions and dynamically mask them. By measuring shifts in prediction confidence after just two black-box queries, the approach effectively flags adversarial inputs. Extensive experiments demonstrate that this method consistently outperforms existing baselines across multiple benchmark datasets against a range of state-of-the-art attacks, achieving superior detection performance without requiring any model-specific adaptation or internal information.

Technology Category

Application Category

📝 Abstract
Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.
Problem

Research questions and friction points this paper is trying to address.

textual adversarial detection
black-box defense
adversarial example
NLP security
model-agnostic detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

black-box detection
Replaced Token Detection (RTD)
textual adversarial examples
adversarial defense
query-efficient
🔎 Similar Papers
No similar papers found.
He Zhu
He Zhu
ICT, UCAS
Yanshu Li
Yanshu Li
Brown University
NLPMultimodal Learning
W
Wen Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
H
Haitian Yang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China