🤖 AI Summary
This work addresses the critical gap in current vision-language models (VLMs) regarding their inability to reason about contextual privacy norms in image geolocation, often leading to unintended disclosure of sensitive location information that contradicts user intent. Rather than advocating for the blanket disabling of geolocation capabilities, the authors propose a privacy disclosure evaluation framework grounded in social norms and context-aware reasoning, emphasizing the need for VLMs to develop situated privacy judgment. To this end, they introduce VLM-GEOPRIVACY, a benchmark combining real-world images with human privacy judgments to systematically assess contextual sensitivity. Experiments on 14 state-of-the-art VLMs reveal that, despite high geolocation accuracy, these models consistently fail to align with human privacy expectations and remain vulnerable to prompt-based attacks that trigger excessive disclosure.
📝 Abstract
Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.