🤖 AI Summary
In multimodal ophthalmic diagnosis, unequal access to medical resources often leads to missing modalities (e.g., fundus photography or OCT alone), undermining model performance. Existing modality completion and knowledge distillation methods suffer from lesion reconstruction artifacts and strong reliance on fully paired multimodal data.
Method: We propose a robust cross-modal alignment framework based on labeled optimal transport. It introduces a novel class-level prototype-guided semantic alignment mechanism, integrated with asymmetric cross-modal feature sharing and label-driven soft matching, enabling multiscale complementary information fusion under modality absence.
Contribution/Results: Evaluated on three large-scale multimodal ophthalmic datasets, our method achieves state-of-the-art performance in both complete and missing-modality settings. It significantly improves robustness and generalizability in disease grading—particularly under real-world data scarcity and modality imbalance—without requiring full modality pairing or pixel-level reconstruction.
📝 Abstract
Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model's superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at https://github.com/Qinkaiyu/RIMA