Explaining multimodal LLMs via intra-modal token interactions

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM interpretability methods focus on cross-modal attribution while neglecting intra-modal token-level dynamic dependencies: isolated patch attribution in vision is constrained by local receptive fields, causing fragmented explanations; sequential token dependencies in text often induce spurious activations, degrading attribution fidelity. This paper proposes Multi-Scale Explanation Aggregation (MSEA) and Activation-Ranking Correlation (ARC), the first framework to systematically model fine-grained intra-modal interaction mechanisms in both vision and language. MSEA enhances intra-modal explanation consistency via multi-scale input aggregation, top-k prediction alignment, and dynamic suppression of irrelevant context. Evaluated across mainstream MLLMs and standard benchmarks, our approach significantly improves attribution coherence, accuracy, and fidelity—yielding more complete and robust cross-modal explanations.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce extit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose extit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
Problem

Research questions and friction points this paper is trying to address.

Understanding intra-modal dependencies in multimodal LLM decision-making mechanisms
Addressing fragmented visual explanations by aggregating multi-scale attribution inputs
Mitigating spurious textual activations through contextual token relevance ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale aggregation for holistic visual explanations
Activation ranking correlation to suppress spurious activations
Intra-modal interaction analysis for faithful model interpretation
🔎 Similar Papers
No similar papers found.
J
Jiawei Liang
School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University
Ruoyu Chen
Ruoyu Chen
Institute of Information Engineering, Chinese Academy of Sciences.
Explainable AITrustworthy AIFoundation Model
X
Xianghao Jiao
School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University
Siyuan Liang
Siyuan Liang
College of Computing and Data Science, Nanyang Technological University
Trustworthy Foundation Model
S
Shiming Liu
Huawei Technologies Co., Ltd.
Qunli Zhang
Qunli Zhang
Imperial College London
Z
Zheng Hu
Huawei Technologies Co., Ltd.
Xiaochun Cao
Xiaochun Cao
Sun Yat-sen University
Computer VisionArtificial IntelligenceMultimediaMachine Learning