DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing large vision-language models in retinal medical report generation, which suffer from data scarcity, overfitting, and difficulty in recognizing subtle pathological features. To overcome these challenges, the authors propose DREAM, a novel framework that integrates retinal images with ophthalmologist-curated clinical keywords through a two-stage fusion mechanism. First, an Abstractor module maps multimodal features into a shared semantic space to enhance visual representations; then, an Adaptor module dynamically weights and fuses these features. A contrastive alignment training strategy is further introduced to ensure clinical semantic fidelity in generated reports. By synergistically combining expert knowledge guidance with adaptive multimodal fusion, DREAM achieves high-fidelity report generation even under limited data conditions. Experiments show that DREAM attains a BLEU-4 score of 0.241 on DeepEyeNet, setting a new state-of-the-art, and demonstrates strong generalization performance on the ROCO dataset.

Technology Category

Application Category

📝 Abstract

Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.

Problem

Research questions and friction points this paper is trying to address.

medical report generation

retinal image analysis

data scarcity

pathology detection

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive multi-modal fusion

clinical keyword integration

contrastive alignment