Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

📅 2025-05-01

📈 Citations: 3

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Multimodal large language models (MLLMs) often inadvertently memorize sensitive information—such as personally identifiable data or harmful content—during training, and multimodal prompts can be exploited by adversaries to extract such knowledge. Yet, systematic evaluation of multimodal forgetting remains absent. Method: We introduce UnLOK-VQA, the first benchmark for targeted sensitive-knowledge forgetting in MLLMs, built upon high-quality, human-curated image–text pairs. We design both white-box and black-box multimodal extraction attacks and propose a white-box forgetting method grounded in hidden-state interpretability. Contribution/Results: We find that erasing answer-related hidden-state representations is the most effective defense, and model scale positively correlates with post-forgetting robustness. Experiments demonstrate that multimodal attacks significantly outperform unimodal ones, establishing a foundational framework for secure forgetting research in MLLMs.

Technology Category

Application Category

📝 Abstract

LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs risk retaining sensitive data across modalities

Lack of benchmarks for targeted unlearning in multimodal contexts

Need effective defenses against multimodal attacks on sensitive knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multimodal unlearning benchmark UnLOK-VQA

Evaluates defense objectives against diverse attacks

Leverages interpretability of hidden states for defense

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?