🤖 AI Summary
Existing interpretability methods struggle to efficiently and uniformly explain local predictions of multimodal models, often constrained by unimodal assumptions, architectural dependencies, or high computational costs. This work proposes a model-agnostic local explanation framework that constructs a modality-aware surrogate model with group-structured sparsity to jointly disentangle modality-level contributions and feature-level attributions. By leveraging only a few forward passes and integrating modality-aware optimization with group-sparse regularization, the method delivers faithful, black-box-compatible explanations at low computational cost. Evaluated on vision–language question answering and clinical prediction tasks, it reduces black-box queries by 35–67× and runtime by 2–8× compared to strong baselines, while maintaining competitive deletion fidelity.
📝 Abstract
Multimodal models are ubiquitous, yet existing explainability methods are often single-modal, architecture-dependent, or too computationally expensive to run at scale. We introduce LEMON (Local Explanations via Modality-aware OptimizatioN), a model-agnostic framework for local explanations of multimodal predictions. LEMON fits a single modality-aware surrogate with group-structured sparsity to produce unified explanations that disentangle modality-level contributions and feature-level attributions. The approach treats the predictor as a black box and is computationally efficient, requiring relatively few forward passes while remaining faithful under repeated perturbations. We evaluate LEMON on vision-language question answering and a clinical prediction task with image, text, and tabular inputs, comparing against representative multimodal baselines. Across backbones, LEMON achieves competitive deletion-based faithfulness while reducing black-box evaluations by 35-67 times and runtime by 2-8 times compared to strong multimodal baselines.