RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated radiology report evaluation methods lack clinical grounding, interpretability, and fine-grained analysis, often relying on black-box models or coarse overall scores—limiting their integration into real-world clinical workflows. This paper introduces the first explainable, fine-grained assessment framework targeting six clinically defined error types, generating both granular sub-scores and natural-language justifications for each. Methodologically, it innovatively incorporates dynamic sub-score weighting and majority-guided advantage scaling, synergistically enhanced by group-relative optimization, F1-driven weight adaptation, and consensus gradient regularization—collectively enabling adaptive discrimination of complex, subtle errors. Evaluated on the ReXVal benchmark, our approach comprehensively outperforms existing offline metrics, matching GPT-4’s assessment accuracy while offering superior transparency, strong clinical alignment, and significantly lower deployment overhead.

Technology Category

Application Category

📝 Abstract
Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

Lack of clinically grounded interpretable radiology report metrics
Existing methods produce coarse scores or opaque black-box models
Need for fine-grained error scoring with human-readable justifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sub-score Dynamic Weighting prioritizes error types
Majority-Guided Advantage Scaling adjusts policy updates
Group Relative Policy Optimization enables stable optimization
🔎 Similar Papers
No similar papers found.
Y
Yingshu Li
School of Electrical and Computer Engineering, University of Sydney, NSW 2006, Australia
Yunyi Liu
Yunyi Liu
The University of Sydney
LLMVQAVisual GroundingReport GenerationMedical Image
Lingqiao Liu
Lingqiao Liu
Associate Professor at the University of Adelaide
computer visionmachine learning
L
Lei Wang
School of Computing and Information Technology, University of Wollongong, NSW 2522, Australia
Luping Zhou
Luping Zhou
School of Electrical and Computer Engineering, University of Sydney
Medical ImagingComputer VisionMachine Learning