Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained pathological reasoning on optical coherence tomography angiography (OCTA) images is hindered by the scarcity of large-scale, high-fidelity image–text paired datasets annotated with precise pathological descriptions. Method: We propose SVR, a controllable synthesis framework that jointly models vascular structure and pathological features to generate realistic OCTA images exhibiting diabetic retinopathy manifestations (e.g., capillary non-perfusion, microaneurysms) alongside corresponding fine-grained explanatory texts. SVR integrates generative modeling for vessel-texture synthesis and pathology overlay, multi-stage prompt-driven text generation, and fine-tuning of the Qwen3-VL-8B multimodal foundation model. Contribution/Results: We release OCTA-100K-SVR—the first large-scale OCTA image–reasoning paired dataset. Evaluated on real OCTA images, our method achieves 89.67% zero-shot balanced accuracy, surpassing supervised baselines. Clinical assessment confirms significant improvements in diagnostic interpretability and lesion localization precision.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic OCTA images with diabetic retinopathy features
Creates large-scale image-text dataset for VLM training
Improves VLM diagnostic accuracy and explanation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes realistic retinal vasculature with DR features
Automatically generates granular reasoning texts for images
Creates large-scale OCTA image-reasoning dataset for training
🔎 Similar Papers
No similar papers found.
C
Chenjun Li
School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
Cheng Wan
Cheng Wan
Georgia Institute of Technology
L
Laurin Lux
Weill Cornell Medicine, New York, NY 10021, USA
A
Alexander Berger
Weill Cornell Medicine, New York, NY 10021, USA
R
Richard B. Rosen
New York Eye and Ear Infirmary of Mount Sinai, New York, NY
Martin J. Menten
Martin J. Menten
Technical University of Munich
Machine Learning for HealthcareMedical ImagingComputer Vision
Johannes C. Paetzold
Johannes C. Paetzold
Cornell University, Weill Cornell Medicine
Machine LearningGeometric Deep LearningGenerative ModelsBiomedical Image Analysis