🤖 AI Summary
To address the risk of sensitive attribute leakage (e.g., identity, gender) from latent visual features in video foundation models, this paper proposes the first general-purpose anonymization paradigm tailored for video latent spaces. Our method requires no retraining of the backbone model; instead, it introduces a lightweight, plug-and-play anonymization adapter applied atop a frozen video encoder. Leveraging self-supervised privacy constraints, joint task optimization, and latent consistency loss, the framework achieves end-to-end feature sanitization. Evaluated on Kinetics400 and UCF101, it maintains near-baseline performance across diverse downstream tasks—including action recognition, temporal action detection, and anomaly detection—with accuracy drops under 1.2%. Privacy leakage is reduced by 35%, and gender classification bias is significantly mitigated. The approach thus delivers strong privacy guarantees, high task utility, and improved model fairness without architectural or training overhead.
📝 Abstract
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.