π€ AI Summary
Existing methods for medical image report generation struggle to achieve fine-grained spatial alignment between lesion locations and textual descriptions and lack effective means to evaluate spatial grounding capabilities. This work proposes a plug-and-play Discriminative CueβPrompted Generation framework with Prompt Dropout (DCP-PD), which extracts discriminative cues from free-text reports to guide 3D CT report generation and incorporates a prompt dropout mechanism to prevent the model from relying on superficial shortcuts. The study further introduces, for the first time, a hierarchical, position-aware question-set protocol to directly assess pathology-to-location grounding ability. On the CT-RATE benchmark, the method achieves a macro F1 score of 0.603, representing a 20% relative improvement, and demonstrates substantial generalization gains on out-of-domain Rad-ChestCT data, where F1 rises from 0.266 to 0.503βa 89% relative increase.
π Abstract
Vision--language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.