Calibrated Confidence Expression for Radiology Report Generation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the overconfidence of large vision-language models in radiology report generation, which often lack reliable and interpretable confidence estimates, thereby compromising clinical safety review. To tackle this issue, the authors propose ConRad, a novel framework that, for the first time, enables calibrated, verbalized confidence outputs in multimodal medical report generation. ConRad employs a reward function derived from the logarithmic scoring rule and fine-tunes the model using the GRPO reinforcement learning algorithm, allowing it to generate both radiology reports and associated report-level or sentence-level confidence scores that align closely with clinical judgment. Experimental results demonstrate that ConRad significantly improves confidence calibration and effectively supports targeted human verification, facilitating the safe deployment of AI in clinical settings.
📝 Abstract
Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.
Problem

Research questions and friction points this paper is trying to address.

confidence calibration
radiology report generation
large vision-language models
clinical safety
verbalized confidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration
radiology report generation
large vision-language models
reinforcement learning
verbalized confidence
🔎 Similar Papers
No similar papers found.
David Bani-Harouni
David Bani-Harouni
Technical University of Munich
Chantal Pellegrini
Chantal Pellegrini
Technical University of Munich
Deep LearningComputer VisionMedical ImagingNatural Language Processing
J
Julian Lüers
Computer Aided Medical Procedures, Technical University of Munich, Germany
S
Su Hwan Kim
Department of Diagnostic and Interventional Radiology, TUM Klinikum rechts der Isar, Germany; Department of Diagnostic and Interventional Neuroradiology, TUM Klinikum rechts der Isar, Germany
M
Markus Baalmann
Department of Diagnostic and Interventional Radiology and Nuclear Medicine, University Medical Center Hamburg-Eppendorf, Germany
B
Benedikt Wiestler
Munich Center for Machine Learning (MCML), Germany; Department of Diagnostic and Interventional Neuroradiology, TUM Klinikum rechts der Isar, Germany; AI for Image-Guided Diagnosis and Therapy, Technical University of Munich, Germany
Rickmer Braren
Rickmer Braren
Technical University Munich
RadiologyQuantitative Image AnalysisArtificial IntelligenceOncologic ImagingPancreatic Cancer
Nassir Navab
Nassir Navab
Professor of Computer Science, Technische Universität München
Matthias Keicher
Matthias Keicher
Technische Universität München