How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the critical gap in current vision-language models (VLMs), which lack embodied awareness of privacy in physical environments, while existing evaluation benchmarks remain confined to textual modalities and fail to capture real-world complexity. To bridge this gap, we introduce ImmersedPrivacy—the first multimodal, context-aware privacy evaluation framework grounded in the physical world—leveraging a Unity-based immersive audiovisual simulation environment. We systematically assess twelve state-of-the-art VLMs across three dimensions: sensitive object recognition, adaptation to social contexts, and resolution of instruction–privacy conflicts. Our experiments reveal that all models exhibit significant performance degradation as scene complexity increases, achieve less than 65% accuracy under varying social contexts, and even the best-performing model successfully balances task execution with privacy preservation in only 51% of conflict scenarios, exposing systemic deficiencies in current VLMs’ capacity for real-world privacy-sensitive decision-making.

📝 Abstract

As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state-of-the-art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini-3.1-pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at https://github.com/immersed-privacy/immersed-privacy .

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Privacy Awareness

Physical World

Embodied Agents

Social Context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Privacy Awareness

Embodied AI