π€ AI Summary
This work addresses the lack of a unified framework for evaluating clinical validity and robustness of automatic metrics in CT imaging report generation. We propose the first comprehensive evaluation suite tailored to clinical feasibility, integrating three components: writing-style generalization via LLM-based rewriting, graded synthetic error injection, and expert disagreement case analysis. Systematically assessing eight mainstream metrics using a CT-CLIP encoder and seven large language models, we correlate metric scores with expert human judgments via Spearmanβs Ο. Our experiments reveal that GREEN Score achieves the highest agreement with clinicians (Ο β 0.70), while CRG shows a negative correlation, and BERTScore-F1 is least sensitive to factual errors. These findings expose critical limitations of current metrics in clinical settings and establish a reproducible benchmark for medical natural language generation evaluation.
π Abstract
In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175"disagreement"cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.