π€ AI Summary
Existing evaluation metrics exhibit inconsistent behavior in multimodal machine unlearning tasks, making it difficult to reliably assess unlearning efficacy. This work systematically analyzes the conflicting rankings produced by five widely used metrics across three visual question answering (VQA) benchmarks and proposes a Unified Quality Score (UQS) that achieves more stable performance ranking by weighting each metric according to its distance correlation with an idealized reference model. Empirical evaluation on 36 variants of LLaVA-1.5-7B and BLIP-2 models reveals substantial discrepancies in metric-induced rankings. The proposed UQS demonstrates high stability under 100 random perturbations, achieving a Kendallβs Ο of 0.647β―Β±β―0.262. The authors publicly release the benchmark suite, model checkpoints, and an interactive leaderboard to support reproducible research in multimodal unlearning.
π Abstract
Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.