Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing encoder-based adversarial attacks exhibit poor transferability across large vision-language models (LVLMs), particularly under black-box settings. This work systematically identifies the root causes for the first time: inconsistent visual focus across models and redundant semantic alignment within individual models. To address these issues, we propose the Semantic-Guided Multimodal Attack (SGMA) framework, which leverages attention guidance to concentrate perturbations on visually critical regions while simultaneously disrupting cross-modal alignment. Extensive experiments across multiple mainstream LVLMs and tasks demonstrate that SGMA significantly improves the transfer success rate of adversarial examples, thereby exposing critical security vulnerabilities in real-world deployments of LVLMs.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
Problem

Research questions and friction points this paper is trying to address.

adversarial transferability
large vision-language models
encoder-based attacks
black-box scenarios
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial transferability
vision-language models
encoder-based attack
semantic grounding
multimodal alignment
🔎 Similar Papers
No similar papers found.
X
Xinwei Zhang
The Hong Kong Polytechnic University, Hong Kong, China
L
Li Bai
The Hong Kong Polytechnic University, Hong Kong, China
Tianwei Zhang
Tianwei Zhang
Nanyang Technological University
Computer System Security
Y
Youqian Zhang
The Hong Kong Polytechnic University, Hong Kong, China
Qingqing Ye
Qingqing Ye
Assistant Professor, The Hong Kong Polytechnic University
data privacy and securityadversarial machine learning
Y
Yingnan Zhao
Harbin Engineering University, Harbin, China
R
Ruochen Du
Harbin Engineering University, Harbin, China
Haibo Hu
Haibo Hu
Professor, Hong Kong Polytechnic University
Data privacy and securityadversarial machine learningmobile and spatiotemporal databases