RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing OCT foundation models rely solely on image-based self-supervised learning, resulting in limited semantic understanding—particularly of anatomical structures and pathological concepts—and consequently suboptimal performance on complex downstream tasks; moreover, they require supervised fine-tuning for domain adaptation to specific patient populations. Method: We propose the first vision-language joint optimization framework tailored for retinal OCT, which integrates textual supervision into representation learning without additional annotations, via multi-objective collaborative training—including contrastive learning, image–text matching, and masked language modeling. Contribution/Results: The framework significantly enhances anatomical and pathological semantic comprehension, enabling zero-shot or lightweight direct adaptation. Evaluated on seven OCT classification tasks using RETFound, UrFound, and VisionFM, it achieves average linear probe accuracy gains of +5.8, +3.9, and +2.1 percentage points, respectively, consistently outperforming all baselines.

Technology Category

Application Category

📝 Abstract

The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at https://github.com/ronnief1/RetFiner.

Problem

Research questions and friction points this paper is trying to address.

Enhancing retinal foundation models' semantic understanding with vision-language refinement

Improving downstream performance without costly supervised fine-tuning

Enabling efficient adaptation to diverse OCT classification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language refinement for retinal foundation models

Self-supervised learning with diverse training objectives

Leveraging textual data for improved semantic understanding

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge