🤖 AI Summary
In multimodal learning, modality missing during deployment—due to cost constraints, sensor failures, or subjective clinical decisions—introduces selection bias; ignoring the missingness mechanism when estimating multimodal information gain systematically overestimates the value of redundant modalities, undermining model robustness. This work formally characterizes how missingness mechanisms distort information gain estimation and proposes ICYM2I, a causal debiasing framework based on inverse probability weighting (IPW), to yield unbiased estimates of each modality’s true contribution. ICYM2I integrates multimodal missingness modeling, synthetic/semi-synthetic data generation, and rigorous validation on medical benchmarks. Evaluated on synthetic, semi-synthetic, and real-world clinical datasets, it significantly improves the accuracy of information gain estimation, effectively mitigates overestimation of redundant modalities, and enhances cross-scenario deployment reliability.
📝 Abstract
Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different types of data. However, modalities collected and curated during development may differ from the modalities available at deployment due to multiple factors including cost, hardware failure, or -- as we argue in this work -- the perceived informativeness of a given modality. Na{""i}ve estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality's value in downstream tasks. Our work formalizes the problem of missingness in multimodal learning and demonstrates the biases resulting from ignoring this process. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction. We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world medical datasets.