🤖 AI Summary
Current chest X-ray report generation methods suffer from weak disease-aware visual representations and insufficient vision–language alignment, leading to the neglect of critical pathological features and limited clinical accuracy. To address this, we propose a two-stage disease-aware framework. In Stage I, we introduce disease-aware semantic tokens and a Disease–Vision Attention Fusion (DVAF) module to explicitly model fine-grained associations between lesion regions and clinical concepts. In Stage II, a Dual-Modality Similarity Retrieval (DMSR) mechanism dynamically retrieves highly similar historical image–report pairs during generation to enhance semantic consistency. Our method integrates cross-attention, multi-label classification, and contrastive learning. Evaluated on CheXpert Plus, IU X-ray, and MIMIC-CXR, it achieves state-of-the-art performance, significantly improving pathology coverage, diagnostic accuracy, and linguistic fluency.
📝 Abstract
Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.