Text Adversarial Attacks with Dynamic Outputs

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the challenge of adapting text adversarial attacks to dynamic-output scenarios—where the label space and cardinality are variable. We propose TDOA (Textual Dynamic-Output Attack), a novel framework comprising two key components: (i) a lightweight surrogate model constructed via clustering to transform the dynamic multi-output problem into a static single-output one; and (ii) a farthest-label targeted attack strategy designed to enhance perturbation transferability and attack strength. To our knowledge, TDOA is the first systematic solution for adversarial attacks in dynamic-output text classification. Evaluated across four datasets and eight target models, it achieves up to 50.81% attack success rate (ASR) in dynamic settings and 82.68% in static ones under single-query constraints. In generative tasks, TDOA improves RDBLEU and RDchrF scores by 0.64 and 0.62 over prior state-of-the-art methods. We further introduce the RDBLEU/RDchrF evaluation framework, enabling robustness assessment for dynamic-generation tasks.

Technology Category

Application Category

📝 Abstract

Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.

Problem

Research questions and friction points this paper is trying to address.

Attacking text models with dynamic output spaces

Developing clustering-based method to handle variable labels

Improving adversarial effectiveness with farthest-label strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering-based surrogate model converts dynamic outputs to static

Farthest-label strategy selects maximally disruptive adversarial vectors

Framework extends to generative settings via translation-as-classification conceptualization

🔎 Similar Papers

No similar papers found.

Authors to Follow