Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the susceptibility of existing medical imaging report summarization models to visual noise when generating “impression” sections from “findings,” which often prevents them from outperforming purely text-based approaches. Challenging the assumption that more visual input is inherently better, this work proposes a multi-stage selective visual fusion framework. It leverages MedSAM2 for lung segmentation to localize pathological regions and integrates bidirectional cross-attention for multi-view fusion, Shapley-value-guided adaptive image patch clustering, and a hierarchical visual tokenization strategy within a Vision Transformer. By selectively utilizing only high-importance visual regions, the method achieves state-of-the-art performance on MIMIC-CXR (BLEU-4: 29.25%, ROUGE-L: 69.83%), significantly enhancing factual consistency, clinical relevance, and human expert ratings.
📝 Abstract
Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.
Problem

Research questions and friction points this paper is trying to address.

radiology summarization
multimodal learning
visual attention
medical image analysis
report generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

selective visual attention
multimodal summarization
Shapley-guided patch clustering
ViTAS
radiology report generation
🔎 Similar Papers
No similar papers found.
M
Mst. Fahmida Sultana Naznin
Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
A
Adnan Ibney Faruq
Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
M
Mushfiqur Rahman
Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
N
Niloy Kumar Mondal
Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
Md. Mehedi Hasan Shawon
Md. Mehedi Hasan Shawon
Lecturer, BSRM School of Engineering, BRAC University
Health InformaticsMedical ImagingData SciecneArtificial IntelligenceExplainable AI
Md Rakibul Hasan
Md Rakibul Hasan
PhD Candidate (Computing) at Curtin University || Senior Lecturer (on leave) at BRAC University
natural language processingdeep learning