🤖 AI Summary
Current automated evaluation metrics for radiology reports suffer from insufficient granularity and poor interpretability, failing to capture clinically meaningful nuances. To address this, we propose a clinical-driven, tabular evaluation framework that assesses report quality at the attribute level across six dimensions: lesion presence and five key clinical attributes—onset, change, severity, anatomical localization, and clinical recommendation—enabling multi-dimensional alignment. We introduce CLEAR-Bench, the first expert-annotated benchmark curated by consensus among five board-certified radiologists. Our framework integrates rule- and model-based attribute extraction, knowledge-guided structured comparison, and a multi-attribute weighted consistency scoring mechanism. On CLEAR-Bench, our automated evaluation achieves a Pearson correlation of 0.89 with physician ratings—significantly outperforming conventional text-similarity metrics—and delivers both high clinical fidelity and strong interpretability.
📝 Abstract
Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR's multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.