Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While CLIP exhibits strong generalization in coarse-grained image–text alignment, its performance degrades on fine-grained vision tasks (e.g., object detection, semantic segmentation), and its adversarial robustness and cross-task transferability remain poorly understood. Method: We propose MT-AdvCLIP, the first multi-task adversarial framework to systematically investigate the cross-task transferability of adversarial examples generated from fine-grained tasks to other vision–language tasks. We observe that perturbations crafted on fine-grained tasks exhibit superior cross-task attack efficacy. Our method introduces a task-aware feature aggregation loss, integrating adversarial training, joint multi-task optimization, and feature disentanglement to jointly model image–text retrieval, object detection, and semantic segmentation. Contribution/Results: On multiple benchmarks, MT-AdvCLIP achieves an average attack success rate improvement of over 39%, significantly enhancing black-box transfer attacks against diverse CLIP-derived models. This work establishes a new paradigm for evaluating CLIP’s generalization limits and security vulnerabilities.

Technology Category

Application Category

📝 Abstract
As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.
Problem

Research questions and friction points this paper is trying to address.

Investigating adversarial example transferability across different vision-language tasks in CLIP models
Addressing CLIP's vulnerability to adversarial attacks from fine-grained tasks like detection and segmentation
Developing multi-task adversarial framework to enhance cross-task attack generalization capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task adversarial framework enhances CLIP robustness
Task-aware feature aggregation loss improves cross-task transfer
Perturbations boost attack success without extra budget
🔎 Similar Papers
No similar papers found.
K
Kuanrong Liu
Shenzhen Campus of Sun Yat-sen University, China
Siyuan Liang
Siyuan Liang
College of Computing and Data Science, Nanyang Technological University
Trustworthy Foundation Model
C
Cheng Qian
National Key Laboratory of Science and Technology on Information System Security, China.
M
Ming Zhang
National Key Laboratory of Science and Technology on Information System Security, China.
Xiaochun Cao
Xiaochun Cao
Sun Yat-sen University
Computer VisionArtificial IntelligenceMultimediaMachine Learning