🤖 AI Summary
This study addresses the limited clinical validation of existing chest X-ray report generation models. The authors propose CXRMate-2, a novel framework that integrates structured multimodal conditional modeling with reinforcement learning, enhanced by a composite semantic reward mechanism to improve semantic alignment between generated and radiologist-written reports. Evaluated on benchmarks including MIMIC-CXR, CXRMate-2 outperforms MedGemma 1.5 (4B) by 11.2% on the GREEN metric and by 24.4% on RadGraph-XL. Notably, in large-scale real-world evaluation, 45% of the generated reports were deemed clinically acceptable by three blinded radiologists—a first for such systems—and showed no significant preference over human-written reports for most common findings, while often exhibiting superior readability, thereby establishing a credible pathway toward clinical deployment.
📝 Abstract
Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B).
To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability.
Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.