Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the significant performance degradation in clinical multimodal diagnosis caused by arbitrary missing modalities. To tackle this challenge, the authors propose a Context-driven Missing Modality Learning (CMML) framework that synthesizes absent modalities using a cascaded residual Transformer autoencoder. The approach incorporates learnable context tokens as dataset-level semantic priors and leverages both modality-specific memory banks and instance-adaptive semantic references to achieve cross-modal semantic alignment and unified representation learning. Extensive experiments demonstrate that CMML consistently outperforms current state-of-the-art methods, yielding average AUC improvements of 1.26%, 0.97%, and 1.32% on the Derm7pt, ODIR, and MEN datasets, respectively.

📝 Abstract

While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

Problem

Research questions and friction points this paper is trying to address.

missing-modality

multimodal learning

medical diagnosis

image-tabular data

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

missing-modality learning

context-driven synthesis

multimodal alignment