Vid-SME: Membership Inference Attacks against Large Video Understanding Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing membership inference attacks (MIAs) struggle to model temporal dynamics and frame-count sensitivity in video data, resulting in extremely low true positive rates (TPRs) at low false positive rates (FPRs)—posing significant risks of privacy leakage during large video foundation model training. To address this, we propose the first video-specific MIA framework. Our method introduces a Sharma-Mittal entropy (SME)-based video-specific metric that computes robust membership scores from SME discrepancies between natural and time-reversed videos. We further design an adaptive parameterization mechanism to mitigate performance degradation induced by frame count expansion, and integrate multi-frame temporal-sensitivity modeling. Evaluated across diverse proprietary and open-source video foundation models, our approach achieves an average 3.2× improvement in TPR@Low FPR over prior methods—marking the first demonstration of high-accuracy, scalable membership inference in the video domain.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) demonstrate remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies. To address these challenges, we introduce Vid-SME, the first membership inference method tailored for video data used in video understanding LLMs (VULLMs). Vid-SME leverages the confidence of model output and integrates adaptive parameterization to compute Sharma-Mittal entropy (SME) for video inputs. By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set. Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
Problem

Research questions and friction points this paper is trying to address.

Addressing privacy risks in video understanding models
Detecting improperly used training videos effectively
Overcoming limitations of existing inference attack methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive parameterization for Sharma-Mittal entropy
Leverages confidence of model output
Compares natural and reversed video frames
🔎 Similar Papers