Revisiting Data Auditing in Large Vision-Language Models

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Membership inference (MI) auditing of large vision-language models (VLMs) lacks empirical validity in realistic settings, as existing MI benchmarks suffer from distributional shift between member and non-member images, inflating reported performance. Method: The authors introduce the first unbiased i.i.d. MI benchmark for VLMs and propose an optimal transport–based distribution divergence metric; they theoretically derive the Bayes-optimal irreducible error upper bound for MI in VLM embedding spaces. Contribution/Results: They systematically identify three practically viable auditing scenarios—fine-tuning, availability of ground-truth text, and ensemble inference. Experiments show that state-of-the-art MI methods degrade to ≈50% AUC (i.e., random guessing) on the unbiased benchmark, exposing severe limitations in real-world auditability. This work establishes fundamental theoretical limits for data membership auditing and provides actionable, empirically grounded methodologies for trustworthy data governance.

Technology Category

Application Category

📝 Abstract

With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC>80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

Problem

Research questions and friction points this paper is trying to address.

Assessing data auditing challenges in Vision-Language Models (VLMs)

Identifying distribution shifts in membership inference benchmarks

Exploring feasible scenarios for effective VLM data auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes optimal transport metric for distribution discrepancy

Constructs unbiased i.i.d. benchmarks for MI evaluation

Identifies three feasible scenarios for VLM auditing

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling