๐ค AI Summary
This paper addresses the challenge of adapting text adversarial attacks to dynamic-output scenariosโwhere the label space and cardinality are variable. We propose TDOA (Textual Dynamic-Output Attack), a novel framework comprising two key components: (i) a lightweight surrogate model constructed via clustering to transform the dynamic multi-output problem into a static single-output one; and (ii) a farthest-label targeted attack strategy designed to enhance perturbation transferability and attack strength. To our knowledge, TDOA is the first systematic solution for adversarial attacks in dynamic-output text classification. Evaluated across four datasets and eight target models, it achieves up to 50.81% attack success rate (ASR) in dynamic settings and 82.68% in static ones under single-query constraints. In generative tasks, TDOA improves RDBLEU and RDchrF scores by 0.64 and 0.62 over prior state-of-the-art methods. We further introduce the RDBLEU/RDchrF evaluation framework, enabling robustness assessment for dynamic-generation tasks.
๐ Abstract
Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.