🤖 AI Summary
This work systematically exposes, for the first time, privacy vulnerabilities in multimodal large language models (MLLMs)—specifically, personal identifiable information (PII) leakage from vision-language models (VLMs)—within multimodal settings. To address this, we propose a fine-tuning-free, concept-guided privacy protection method: it localizes and modifies PII-related internal representations via concept vectors and integrates a task-rejection mechanism to dynamically suppress sensitive outputs during inference. We further construct multiple realistic, application-aligned multimodal PII benchmark datasets. Experiments demonstrate that our method achieves an average PII-task rejection rate of 93.3%, while preserving near-original performance on non-sensitive tasks—outperforming existing baselines significantly. The approach is effective, generalizable across diverse VLMs and PII types, and deployment-friendly due to its inference-time, parameter-efficient design.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing and reasoning over diverse modalities, but their advanced abilities also raise significant privacy concerns, particularly regarding Personally Identifiable Information (PII) leakage. While relevant research has been conducted on single-modal language models to some extent, the vulnerabilities in the multimodal setting have yet to be fully investigated. In this work, we investigate these emerging risks with a focus on vision language models (VLMs), a representative subclass of MLLMs that covers the two modalities most relevant for PII leakage, vision and text. We introduce a concept-guided mitigation approach that identifies and modifies the model's internal states associated with PII-related content. Our method guides VLMs to refuse PII-sensitive tasks effectively and efficiently, without requiring re-training or fine-tuning. We also address the current lack of multimodal PII datasets by constructing various ones that simulate real-world scenarios. Experimental results demonstrate that the method can achieve an average refusal rate of 93.3% for various PII-related tasks with minimal impact on unrelated model performances. We further examine the mitigation's performance under various conditions to show the adaptability of our proposed method.