Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically exposes, for the first time, privacy vulnerabilities in multimodal large language models (MLLMs)—specifically, personal identifiable information (PII) leakage from vision-language models (VLMs)—within multimodal settings. To address this, we propose a fine-tuning-free, concept-guided privacy protection method: it localizes and modifies PII-related internal representations via concept vectors and integrates a task-rejection mechanism to dynamically suppress sensitive outputs during inference. We further construct multiple realistic, application-aligned multimodal PII benchmark datasets. Experiments demonstrate that our method achieves an average PII-task rejection rate of 93.3%, while preserving near-original performance on non-sensitive tasks—outperforming existing baselines significantly. The approach is effective, generalizable across diverse VLMs and PII types, and deployment-friendly due to its inference-time, parameter-efficient design.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing and reasoning over diverse modalities, but their advanced abilities also raise significant privacy concerns, particularly regarding Personally Identifiable Information (PII) leakage. While relevant research has been conducted on single-modal language models to some extent, the vulnerabilities in the multimodal setting have yet to be fully investigated. In this work, we investigate these emerging risks with a focus on vision language models (VLMs), a representative subclass of MLLMs that covers the two modalities most relevant for PII leakage, vision and text. We introduce a concept-guided mitigation approach that identifies and modifies the model's internal states associated with PII-related content. Our method guides VLMs to refuse PII-sensitive tasks effectively and efficiently, without requiring re-training or fine-tuning. We also address the current lack of multimodal PII datasets by constructing various ones that simulate real-world scenarios. Experimental results demonstrate that the method can achieve an average refusal rate of 93.3% for various PII-related tasks with minimal impact on unrelated model performances. We further examine the mitigation's performance under various conditions to show the adaptability of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Mitigating PII leakage risks in multimodal language models
Addressing privacy vulnerabilities in vision-language model systems
Developing training-free privacy protection for sensitive information tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-guided mitigation modifies model internal states
Method refuses PII tasks without retraining models
Approach achieves 93.3% refusal rate for privacy
🔎 Similar Papers
No similar papers found.
B
Boyang Zhang
CISPA Helmholtz Center for Information Security
Istemi Ekin Akkus
Istemi Ekin Akkus
Distinguished Member of Technical Staff at Nokia Bell Labs
Distributed SystemsCloud Computing
Ruichuan Chen
Ruichuan Chen
Distinguished Member of Technical Staff @ Bell Labs
Cloud computingMachine learning systemsDecentralized systemsPrivacy
A
Alice Dethise
Nokia Bell Labs
K
Klaus Satzke
Nokia Bell Labs
I
Ivica Rimac
Nokia Bell Labs
Y
Yang Zhang
CISPA Helmholtz Center for Information Security