Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes systematic robustness deficiencies in state-of-the-art vision-language models (VLMs) equipped with defensive mechanisms when evaluated under cross-model settings. Addressing the vulnerability of input/output filtering to transferable attacks, we propose the Multi-Faceted Attack (MFA) framework: it introduces Attention Transfer Attack (ATA), grounded in reward hacking theory, and integrates lightweight transfer enhancement with iterative optimization—enabling high-transferability adversarial perturbations via shared visual representations without model fine-tuning. Our experiments constitute the first systematic evaluation of mainstream production-grade defensive VLMs (e.g., GPT-4o, Gemini-Pro) for cross-model fragility. On real-world commercial models, MFA achieves a 52.8% average attack success rate; overall, it attains 58.5%, surpassing prior art by 34%. These results critically reveal fundamental limitations inherent in current defense paradigms.

Technology Category

Application Category

📝 Abstract
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack
Problem

Research questions and friction points this paper is trying to address.

Exposing cross-model vulnerabilities in defense-equipped vision-language models
Bypassing input-level and output-level filters without model fine-tuning
Demonstrating shared visual representations create cross-model safety weaknesses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-Transfer Attack hides harmful instructions in meta tasks
Lightweight transfer-enhancement algorithm improves cross-model attack transferability
Adversarial images exploit shared visual representations across VLMs
🔎 Similar Papers
No similar papers found.
Y
Yijun Yang
The Chinese University of Hong Kong
L
Lichao Wang
Beijing Institute of Technology
J
Jianping Zhang
The Chinese University of Hong Kong
Chi Harold Liu
Chi Harold Liu
Professor, Vice Dean, Fellow of IET and BCS, Beijing Institute of Technology
IoTMobile Crowd SensingUAV CrowdsensingEmbodied AIDeep Reinforcement Learning
L
Lanqing Hong
Huawei Noah’s Ark Lab
Q
Qiang Xu
The Chinese University of Hong Kong