🤖 AI Summary
In multimodal medical deep learning, the actual contribution of each modality—imaging, text, and physiological signals—to model decisions remains poorly quantified, hindering interpretability and clinical trust. To address this, we propose a model-agnostic, modality-level occlusion-based attribution method that employs sliding-window sensitivity analysis to quantify modality contributions under diverse fusion strategies (early, late, and hybrid). Our work is the first to empirically reveal modality preference and unimodal collapse in multimodal models, and to establish a statistically significant correlation (p < 0.01) between modality-wise attribution scores and the performance of corresponding unimodal baselines. We validate these findings across three clinical domains—radiology, pathology, and time-series physiological monitoring—demonstrating inherent data-level modality imbalance and model-induced bias. The implementation is publicly available.
📝 Abstract
Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Particularly, in the field of medicine with its presence of high dimensional multimodal patient data, multimodal models characterize the next step. However, what is yet very underexplored is how these models process the source information in detail. Methods To this end, we implemented an occlusion-based both model and performance agnostic modality contribution method that quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we could determine a link between our metric and the performance of single modality trained nets. Conclusion The information gain through our metric holds remarkable potential to improve the development of multimodal models and the creation of datasets in the future. With our method we make a crucial contribution to the field of interpretability in deep learning based multimodal research and thereby notably push the integrability of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.