🤖 AI Summary
To address the need for automated generation of structured, interpretable clinical descriptions from ulcerative colitis (UC) endoscopic images, this work proposes a lesion-aware vision-language fusion framework. Methodologically, it employs ResNet as the visual backbone, integrates Grad-CAM heatmaps with CBAM’s channel-spatial attention to enhance lesion localization, and injects clinical metadata—including Mayo Endoscopic Subscore (MES), bleeding, and erosion—into the T5 decoder as natural language prompts, enabling joint optimization of report generation and MES classification. The key contribution lies in being the first to synergistically embed interpretable visual attention mechanisms with domain-specific clinical priors into a multimodal generative pipeline, significantly improving description accuracy, structural consistency, and MES classification performance (average +8.2% over baselines). The resulting system supports clinically compliant, fully automated endoscopy reporting with high reliability and strong interpretability.
📝 Abstract
We present a lesion-aware image captioning framework for ulcerative colitis (UC). The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder. Clinical metadata (MES score 0-3, vascular pattern, bleeding, erythema, friability, ulceration) is injected as natural-language prompts to guide caption generation. The system produces structured, interpretable descriptions aligned with clinical practice and provides MES classification and lesion tags. Compared with baselines, our approach improves caption quality and MES classification accuracy, supporting reliable endoscopic reporting.