🤖 AI Summary
Fine-grained pathological reasoning on optical coherence tomography angiography (OCTA) images is hindered by the scarcity of large-scale, high-fidelity image–text paired datasets annotated with precise pathological descriptions.
Method: We propose SVR, a controllable synthesis framework that jointly models vascular structure and pathological features to generate realistic OCTA images exhibiting diabetic retinopathy manifestations (e.g., capillary non-perfusion, microaneurysms) alongside corresponding fine-grained explanatory texts. SVR integrates generative modeling for vessel-texture synthesis and pathology overlay, multi-stage prompt-driven text generation, and fine-tuning of the Qwen3-VL-8B multimodal foundation model.
Contribution/Results: We release OCTA-100K-SVR—the first large-scale OCTA image–reasoning paired dataset. Evaluated on real OCTA images, our method achieves 89.67% zero-shot balanced accuracy, surpassing supervised baselines. Clinical assessment confirms significant improvements in diagnostic interpretability and lesion localization precision.
📝 Abstract
Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.