Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work systematically exposes, for the first time, privacy vulnerabilities in multimodal large language models (MLLMs)—specifically, personal identifiable information (PII) leakage from vision-language models (VLMs)—within multimodal settings. To address this, we propose a fine-tuning-free, concept-guided privacy protection method: it localizes and modifies PII-related internal representations via concept vectors and integrates a task-rejection mechanism to dynamically suppress sensitive outputs during inference. We further construct multiple realistic, application-aligned multimodal PII benchmark datasets. Experiments demonstrate that our method achieves an average PII-task rejection rate of 93.3%, while preserving near-original performance on non-sensitive tasks—outperforming existing baselines significantly. The approach is effective, generalizable across diverse VLMs and PII types, and deployment-friendly due to its inference-time, parameter-efficient design.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing and reasoning over diverse modalities, but their advanced abilities also raise significant privacy concerns, particularly regarding Personally Identifiable Information (PII) leakage. While relevant research has been conducted on single-modal language models to some extent, the vulnerabilities in the multimodal setting have yet to be fully investigated. In this work, we investigate these emerging risks with a focus on vision language models (VLMs), a representative subclass of MLLMs that covers the two modalities most relevant for PII leakage, vision and text. We introduce a concept-guided mitigation approach that identifies and modifies the model's internal states associated with PII-related content. Our method guides VLMs to refuse PII-sensitive tasks effectively and efficiently, without requiring re-training or fine-tuning. We also address the current lack of multimodal PII datasets by constructing various ones that simulate real-world scenarios. Experimental results demonstrate that the method can achieve an average refusal rate of 93.3% for various PII-related tasks with minimal impact on unrelated model performances. We further examine the mitigation's performance under various conditions to show the adaptability of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Mitigating PII leakage risks in multimodal language models

Addressing privacy vulnerabilities in vision-language model systems

Developing training-free privacy protection for sensitive information tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-guided mitigation modifies model internal states

Method refuses PII tasks without retraining models

Approach achieves 93.3% refusal rate for privacy

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions