OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing ophthalmic AI systems, which are often confined to single-modality analysis and struggle to effectively integrate complementary 3D OCT and 2D en face OCT images while facing deployment challenges in resource-constrained settings. The authors propose OphMAE, a multimodal foundation model for ophthalmic diagnosis built upon a masked autoencoder framework, featuring cross-modal fusion and adaptive inference mechanisms that enable joint 3D/2D pretraining and efficient unimodal inference. Evaluated across 17 diagnostic tasks, OphMAE achieves state-of-the-art performance, with AUCs of 96.9% for AMD and 97.2% for DME. Notably, it maintains strong performance using only 2D inputs (AMD AUC: 93.7%) and retains an AUC of 95.7% with as few as 500 labeled samples, substantially alleviating modality dependency and data efficiency bottlenecks.

📝 Abstract

The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7\% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal imaging

ophthalmic AI

3D OCT

2D en face OCT

resource-limited settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

multimodal fusion

adaptive inference