CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes a coarse-to-fine cross-modal learning framework to address the modality gap between medical images and tabular data for improved disease diagnosis. In the coarse-grained stage, multi-level image features are fused with tabular data, followed by a fine-grained stage that constructs class-aware unimodal and cross-modal prototypes. A hierarchical anchor relation mining mechanism is introduced to align features across multiple granularities and enhance discriminative representations through prototype-guided contrastive learning. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches, achieving AUC improvements of 1.53% and 0.91% on the MEN and Derm7pt datasets, respectively.

Technology Category

Application Category

📝 Abstract

In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.

Problem

Research questions and friction points this paper is trying to address.

modality gap

crossmodal learning

disease diagnosis

multimodal data

medical images

Innovation

Methods, ideas, or system contributions that make the work stand out.

coarse-to-fine learning

crossmodal learning

modality gap