๐ค AI Summary
This work addresses the prevalent issue of factual hallucinations in multimodal large language models (MLLMs) for radiology report generation, where textual outputs often fail to strictly align with visual evidence. To mitigate this, the authors propose a self-consistent reinforcement learning framework featuring a novel two-stage โReason-then-Summarizeโ architecture: the first stage generates fine-grained visual findings, and the second distills these into structured diagnostic labels. Logical consistency between stages is enforced through a multidimensional composite reward function. Leveraging the Group Relative Policy Optimization (GRPO) algorithm and an empirically optimized vision-language backbone, the method significantly reduces hallucination rates on the MIMIC-CXR dataset while achieving state-of-the-art performance on clinical utility metrics.
๐ Abstract
Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel"Reason-then-Summarize"architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.