🤖 AI Summary
Severe scarcity of annotated retinal imaging data hinders ophthalmic AI development, particularly limiting fine-grained, anatomy- and pathology-controllable color fundus photography (CFP) generation.
Method: We propose an LLM-driven structured text generation paradigm to construct the first synthetic dataset of 1.4 million image–text pairs, enabling precise semantic control over anatomical details, disease staging, and lesion types; further, we design a three-stage diffusion model training framework to achieve fine-grained text–image alignment and medically controllable synthesis.
Contribution/Results: Evaluation shows that 62.07% of synthesized images are deemed “indistinguishable from real clinical images” by ophthalmologists. The generated data improves diagnostic accuracy by 10–25% in diabetic retinopathy grading and glaucoma detection tasks, significantly advancing high-fidelity synthetic data generation and clinically interpretable AI deployment.
📝 Abstract
The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to generate images for broader categories with diverse and fine-grained anatomical structures. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, synthetic Caption-CFP dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses large language models (LLMs) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Furthermore, based on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, with 62.07% of text-driven synthetic images indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 10%-25% in diabetic retinopathy grading and glaucoma detection, thereby providing a scalable solution to augment ophthalmic datasets.