LEMON: Local Explanations via Modality-aware OptimizatioN

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing interpretability methods struggle to efficiently and uniformly explain local predictions of multimodal models, often constrained by unimodal assumptions, architectural dependencies, or high computational costs. This work proposes a model-agnostic local explanation framework that constructs a modality-aware surrogate model with group-structured sparsity to jointly disentangle modality-level contributions and feature-level attributions. By leveraging only a few forward passes and integrating modality-aware optimization with group-sparse regularization, the method delivers faithful, black-box-compatible explanations at low computational cost. Evaluated on vision–language question answering and clinical prediction tasks, it reduces black-box queries by 35–67× and runtime by 2–8× compared to strong baselines, while maintaining competitive deletion fidelity.

Technology Category

Application Category

📝 Abstract

Multimodal models are ubiquitous, yet existing explainability methods are often single-modal, architecture-dependent, or too computationally expensive to run at scale. We introduce LEMON (Local Explanations via Modality-aware OptimizatioN), a model-agnostic framework for local explanations of multimodal predictions. LEMON fits a single modality-aware surrogate with group-structured sparsity to produce unified explanations that disentangle modality-level contributions and feature-level attributions. The approach treats the predictor as a black box and is computationally efficient, requiring relatively few forward passes while remaining faithful under repeated perturbations. We evaluate LEMON on vision-language question answering and a clinical prediction task with image, text, and tabular inputs, comparing against representative multimodal baselines. Across backbones, LEMON achieves competitive deletion-based faithfulness while reducing black-box evaluations by 35-67 times and runtime by 2-8 times compared to strong multimodal baselines.

Problem

Research questions and friction points this paper is trying to address.

multimodal models

explainability

model-agnostic

computational efficiency

local explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-aware

group-structured sparsity

model-agnostic