🤖 AI Summary
This work addresses the critical yet underexplored issue of whether existing machine unlearning methods genuinely erase harmful memorized content from vision-language models or merely suppress its surface manifestation. We present the first systematic robustness analysis of unlearning mechanisms in such models, introducing three novel attack paradigms—contextual prompting, multi-prompt composition, and downstream fine-tuning—to probe whether supposedly forgotten knowledge can be reactivated. To facilitate rigorous evaluation, we establish a unified assessment framework and a taxonomy of unlearning techniques. Extensive experiments demonstrate that prevailing unlearning approaches are generally vulnerable to our proposed attacks, revealing that they predominantly mask rather than eliminate memorized information. These findings underscore the need for more reliable multimodal unlearning strategies and provide foundational insights for their development.
📝 Abstract
Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.