🤖 AI Summary
Existing automated metrics for evaluating radiology reports lack clinical grounding and struggle to accurately assess diagnostic correctness and patient safety. To address this gap, this work proposes the first clinically guideline–based evaluation framework that incorporates comprehensive clinical context—including patient age and indication for imaging—and introduces a fine-grained error taxonomy comprising eight categories defined by cardiothoracic radiology experts. The framework further integrates a severity-weighting mechanism to prioritize clinically critical errors. Built upon a large language model architecture augmented with clinical decision rules, the method is trained and validated on three expert-annotated benchmarks: ReXVal, RadJudge, and RadPref. Its evaluations demonstrate strong alignment with radiologists’ judgments (Kendall’s τ = 0.61–0.71; Pearson’s r = 0.71–0.84), significantly outperforming current automatic metrics.
📝 Abstract
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.