Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work exposes systematic robustness deficiencies in state-of-the-art vision-language models (VLMs) equipped with defensive mechanisms when evaluated under cross-model settings. Addressing the vulnerability of input/output filtering to transferable attacks, we propose the Multi-Faceted Attack (MFA) framework: it introduces Attention Transfer Attack (ATA), grounded in reward hacking theory, and integrates lightweight transfer enhancement with iterative optimization—enabling high-transferability adversarial perturbations via shared visual representations without model fine-tuning. Our experiments constitute the first systematic evaluation of mainstream production-grade defensive VLMs (e.g., GPT-4o, Gemini-Pro) for cross-model fragility. On real-world commercial models, MFA achieves a 52.8% average attack success rate; overall, it attains 58.5%, surpassing prior art by 34%. These results critically reveal fundamental limitations inherent in current defense paradigms.

Technology Category

Application Category

📝 Abstract

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

Problem

Research questions and friction points this paper is trying to address.

Exposing cross-model vulnerabilities in defense-equipped vision-language models

Bypassing input-level and output-level filters without model fine-tuning

Demonstrating shared visual representations create cross-model safety weaknesses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-Transfer Attack hides harmful instructions in meta tasks

Lightweight transfer-enhancement algorithm improves cross-model attack transferability

Adversarial images exploit shared visual representations across VLMs

🔎 Similar Papers

No similar papers found.