ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

📅 2024-08-20

📈 Citations: 5

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Addressing the challenges of fine-grained semantic alignment between fundus images and clinical reports, as well as interference from false-negative samples during pretraining, this work introduces the first ophthalmology-specific vision-language foundation model. Leveraging 452,000 high-quality paired fundus image–text samples, we propose three key innovations: (1) a weighted similarity-coupled loss to dynamically regulate inter-class feature separation in the embedding space; (2) an expert-knowledge-guided label extraction mechanism to enhance textual semantic fidelity; and (3) a momentum-encoder-based dynamic memory queue with batch expansion to mitigate negative-sample scarcity caused by false-negative filtering. Extensive experiments demonstrate substantial zero-shot transfer performance gains across multiple ophthalmic downstream tasks—including disease classification and lesion segmentation—outperforming existing vision-language models. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: https://github.com/T6Yang/ViLReF.

Problem

Research questions and friction points this paper is trying to address.

Address subtle semantic differences in retinal image-text pairs

Mitigate false negative samples disrupting visual-language pre-training

Develop retinal foundation model using expert knowledge constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages expert knowledge for label extraction

Introduces Weighted Similarity Coupling Loss constraint

Uses batch expansion with dynamic memory queues

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge