Resilient Vision-Tabular Multimodal Learning under Modality Missingness

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the performance degradation in medical multimodal learning caused by missing imaging or structured clinical data. The authors propose a novel multimodal Transformer framework that explicitly models arbitrary missing modality patterns during both training and inference, without requiring data imputation or model switching. The approach incorporates learnable modality tokens, an intermediate fusion strategy, and a masked self-attention mechanism, while employing random modality dropout as a regularization technique to enhance generalization. The architecture seamlessly integrates a vision encoder, a tabular encoder, and a multimodal fusion encoder, enabling smooth transitions from full-modality to single-modality inputs. Evaluated on 14 diagnostic tasks using the MIMIC-CXR and MIMIC-IV datasets, the method consistently outperforms baseline models across diverse missingness scenarios, demonstrating markedly slower performance decay and superior robustness.

📝 Abstract

Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.

Problem

Research questions and friction points this paper is trying to address.

multimodal learning

modality missingness

vision-tabular fusion

robustness

medical AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality missingness

multimodal transformer

masked self-attention