Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing evaluation metrics exhibit inconsistent behavior in multimodal machine unlearning tasks, making it difficult to reliably assess unlearning efficacy. This work systematically analyzes the conflicting rankings produced by five widely used metrics across three visual question answering (VQA) benchmarks and proposes a Unified Quality Score (UQS) that achieves more stable performance ranking by weighting each metric according to its distance correlation with an idealized reference model. Empirical evaluation on 36 variants of LLaVA-1.5-7B and BLIP-2 models reveals substantial discrepancies in metric-induced rankings. The proposed UQS demonstrates high stability under 100 random perturbations, achieving a Kendall’s τ of 0.647 ± 0.262. The authors publicly release the benchmark suite, model checkpoints, and an interactive leaderboard to support reproducible research in multimodal unlearning.

📝 Abstract

Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.

Problem

Research questions and friction points this paper is trying to address.

Machine Unlearning

Multimodal Models

Evaluation Metrics

Metric Reliability

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Unlearning

Multimodal Evaluation

Metric Reliability