Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation in clinical multimodal diagnosis caused by arbitrary missing modalities. To tackle this challenge, the authors propose a Context-driven Missing Modality Learning (CMML) framework that synthesizes absent modalities using a cascaded residual Transformer autoencoder. The approach incorporates learnable context tokens as dataset-level semantic priors and leverages both modality-specific memory banks and instance-adaptive semantic references to achieve cross-modal semantic alignment and unified representation learning. Extensive experiments demonstrate that CMML consistently outperforms current state-of-the-art methods, yielding average AUC improvements of 1.26%, 0.97%, and 1.32% on the Derm7pt, ODIR, and MEN datasets, respectively.
📝 Abstract
While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.
Problem

Research questions and friction points this paper is trying to address.

missing-modality
multimodal learning
medical diagnosis
image-tabular data
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

missing-modality learning
context-driven synthesis
multimodal alignment
cascade residual transformer
memory-augmented representation
T
Tianling Liu
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China.
Lequan Yu
Lequan Yu
Assistant Professor, The University of Hong Kong
Medical Image AnalysisMultimodal LearningComputational PathologyAI for Healthcare
T
Tong Han
Department of Radiology, Tianjin Huanhu Hospital, Tianjin 300350, China.; Tianjin Key Laboratory of Cerebral Vascular and Neurodegenerative Diseases, Tianjin 300350, China.
Liang Wan
Liang Wan
College of Intelligence and Computing, Medical College, Tianjin University
computer visionmedical image processing