Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of membership inference attacks against vision-language models under realistic black-box API settings, where existing methods rely on internal logits or large-scale statistical distributions and thus fail in single-sample scenarios. The authors propose a novel attack leveraging discrepancies in cross-modal semantic alignment, introducing for the first time the alignment strength between an image and its generated text in the joint embedding space as a discriminative signal. This approach requires neither access to internal model information nor extensive sample statistics. Evaluated under strict black-box, single-sample conditions, the method achieves an AUC of 0.821 against LLaVA-1.5 on the VL-MIA/Flickr dataset, substantially outperforming current baselines, and demonstrates robustness across various image perturbations.

📝 Abstract

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

Problem

Research questions and friction points this paper is trying to address.

Membership Inference Attack

Vision-Language Models

Black-Box

Single-Sample

Cross-modal Semantic Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Membership Inference Attack

Vision-Language Models

Cross-modal Semantic Alignment