🤖 AI Summary
This work addresses the clinical inadequacy of existing radiology report generation methods during inference, which often stems from default decoding strategies that fail to effectively select high-quality candidate reports. To overcome this limitation, the authors propose a decoder-agnostic, multi-candidate selection framework at inference time: multiple candidate reports are sampled, and—uniquely—a clinical consensus evaluation mechanism based on image–report multimodal embeddings is introduced. This approach moves beyond conventional text similarity metrics by establishing a measure of clinical utility independent of surface-level consistency. Evaluated across three datasets and multiple radiology-oriented multimodal large language models, the method significantly improves clinically relevant metrics, outperforming both single-path decoding and generic Best-of-N strategies.
📝 Abstract
Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.