🤖 AI Summary
Current evaluations of forgetting in large vision-language models (LVLMs) often lack reliability due to insufficient verification of whether the models have effectively memorized target information. To address this limitation, this work introduces ReMem, a principled benchmark that establishes a robust memory–forgetting evaluation framework through systematic data augmentation, reasoning-aware question-answer pairs, and multi-image contextual settings. Furthermore, it proposes a novel Exposure metric to quantify the depth of information erasure at the level of probability distributions. Experimental results demonstrate that ReMem effectively diagnoses under-memorization issues in LVLMs during initial learning phases, substantially enhancing the rigor and credibility of forgetting assessments.
📝 Abstract
While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model's internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.