🤖 AI Summary
This work addresses the limitations of existing large vision-language models in retinal medical report generation, which suffer from data scarcity, overfitting, and difficulty in recognizing subtle pathological features. To overcome these challenges, the authors propose DREAM, a novel framework that integrates retinal images with ophthalmologist-curated clinical keywords through a two-stage fusion mechanism. First, an Abstractor module maps multimodal features into a shared semantic space to enhance visual representations; then, an Adaptor module dynamically weights and fuses these features. A contrastive alignment training strategy is further introduced to ensure clinical semantic fidelity in generated reports. By synergistically combining expert knowledge guidance with adaptive multimodal fusion, DREAM achieves high-fidelity report generation even under limited data conditions. Experiments show that DREAM attains a BLEU-4 score of 0.241 on DeepEyeNet, setting a new state-of-the-art, and demonstrates strong generalization performance on the ROCO dataset.
📝 Abstract
Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.