CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the lack of a unified framework for evaluating clinical validity and robustness of automatic metrics in CT imaging report generation. We propose the first comprehensive evaluation suite tailored to clinical feasibility, integrating three components: writing-style generalization via LLM-based rewriting, graded synthetic error injection, and expert disagreement case analysis. Systematically assessing eight mainstream metrics using a CT-CLIP encoder and seven large language models, we correlate metric scores with expert human judgments via Spearman’s ρ. Our experiments reveal that GREEN Score achieves the highest agreement with clinicians (ρ ≈ 0.70), while CRG shows a negative correlation, and BERTScore-F1 is least sensitive to factual errors. These findings expose critical limitations of current metrics in clinical settings and establish a reproducible benchmark for medical natural language generation evaluation.

Technology Category

Application Category

📝 Abstract

In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175"disagreement"cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.

Problem

Research questions and friction points this paper is trying to address.

metric evaluation

clinical validity

CT report generation

radiology report generation

natural language generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CTest-Metric

clinical validity

metric evaluation