π€ AI Summary
Prior work directly adapts general-purpose vision-language models (VLMs) for image privacy classification but lacks systematic, fair zero-shot comparisons against specialized models, obscuring performance ceilings and modality-specific advantages.
Method: We introduce the first zero-shot image privacy classification benchmark and propose task-aligned prompting strategies to rigorously evaluate three leading open-source VLMs against lightweight specialized vision models.
Contribution/Results: Our evaluation spans accuracy, inference efficiency, and robustness to adversarial and natural perturbations (e.g., distortion, compression, occlusion). Results show that despite their larger parameter counts and slower inference, VLMs underperform specialized models in accuracyβyet their cross-modal representations confer significantly enhanced robustness to common image corruptions. This work elucidates the unique value and practical boundaries of multimodal models in privacy-sensitive applications.
π Abstract
While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.