Diffusion-CAM: Faithful Visual Explanations for dMLLMs

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing Class Activation Mapping (CAM) methods struggle to interpret the smooth, distributed activation patterns arising from parallel denoising generation in diffusion-based multimodal large language models (dMLLMs). To address this limitation, this work proposes Diffusion-CAM, the first interpretability method specifically designed for dMLLMs, which abandons the conventional CAM assumption of sequential local dependencies to accommodate the non-autoregressive, parallel nature of diffusion generation. Diffusion-CAM employs differentiable gradient probing of intermediate Transformer features and integrates three key components—spatial deblurring, confounding factor suppression, and mitigation of redundant token correlations—to produce high-fidelity, spatially precise visual explanations. Experimental results demonstrate that Diffusion-CAM significantly outperforms state-of-the-art methods in both localization accuracy and visual fidelity, establishing a new benchmark for interpretability in dMLLMs.

Technology Category

Application Category

📝 Abstract

While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

diffusion Multimodal Large Language Models

interpretability

Class Activation Mapping

non-autoregressive generation

visual explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-CAM

dMLLMs

visual explanation