đ€ AI Summary
To address weak generalization, reliance on pixel-level annotations, and difficulty integrating domain expertise in medical image analysisâparticularly for retinal photographyâthis paper introduces the first retina-specific multimodal foundation model. Methodologically, it pioneers encoding ophthalmological expert knowledge into clinical textual reports as supervision signals, enabling imageâsemantic alignment without pixel-level annotations. It further proposes an anatomy-aware contrastive learning framework that integrates a CLIP-based architecture with retinal anatomical priors while enhancing radiology report text representations. Evaluated on five retinal disease classification and localization tasks, the model achieves a 9.2% average accuracy improvement over prior methods and demonstrates significantly superior zero-shot transfer performance compared to general-purpose vision-language models (VLMs). This work establishes a novel paradigm for constructing foundation models tailored to specialized medical domains.