Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the challenge of membership inference attacks against vision-language models under realistic black-box API settings, where existing methods rely on internal logits or large-scale statistical distributions and thus fail in single-sample scenarios. The authors propose a novel attack leveraging discrepancies in cross-modal semantic alignment, introducing for the first time the alignment strength between an image and its generated text in the joint embedding space as a discriminative signal. This approach requires neither access to internal model information nor extensive sample statistics. Evaluated under strict black-box, single-sample conditions, the method achieves an AUC of 0.821 against LLaVA-1.5 on the VL-MIA/Flickr dataset, substantially outperforming current baselines, and demonstrates robustness across various image perturbations.
📝 Abstract
Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.
Problem

Research questions and friction points this paper is trying to address.

Membership Inference Attack
Vision-Language Models
Black-Box
Single-Sample
Cross-modal Semantic Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Membership Inference Attack
Vision-Language Models
Cross-modal Semantic Alignment
Black-box Attack
Single-sample Inference