One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work exposes a critical security vulnerability in unified vision-language models (VLMs): their susceptibility to cross-task adversarial attacks under multi-task instructions. To address this, we introduce CrossVLAD—the first benchmark tailored for object-level manipulation under cross-task settings—featuring GPT-4-assisted annotation and region-feature optimization. We further propose CRAFT, a region-aware token-alignment attack framework that generates instruction-agnostic, multi-task-consistent adversarial perturbations. Our key contributions are threefold: (1) the first systematic evaluation of cross-task adversarial transferability in VLMs; (2) a novel cross-task success rate metric to quantify attack consistency across diverse tasks; and (3) CRAFT’s gradient-coordinated region-token optimization, which significantly improves perturbation consistency. Extensive experiments on state-of-the-art models—including Florence-2—demonstrate that CRAFT achieves substantially higher cross-task attack success rates and targeted object manipulation accuracy compared to existing methods.

Technology Category

Application Category

📝 Abstract

Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating adversarial attacks on unified vision-language models across tasks

Measuring transferability of attacks via multi-task misclassification rates

Developing efficient region-based framework for cross-task adversarial manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CrossVLAD benchmark for cross-task adversarial attacks

CRAFT method for region-centric adversarial attacks

Token-alignment for efficient cross-task manipulation

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning