CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated metrics for evaluating radiology reports lack clinical grounding and struggle to accurately assess diagnostic correctness and patient safety. To address this gap, this work proposes the first clinically guideline–based evaluation framework that incorporates comprehensive clinical context—including patient age and indication for imaging—and introduces a fine-grained error taxonomy comprising eight categories defined by cardiothoracic radiology experts. The framework further integrates a severity-weighting mechanism to prioritize clinically critical errors. Built upon a large language model architecture augmented with clinical decision rules, the method is trained and validated on three expert-annotated benchmarks: ReXVal, RadJudge, and RadPref. Its evaluations demonstrate strong alignment with radiologists’ judgments (Kendall’s τ = 0.61–0.71; Pearson’s r = 0.71–0.84), significantly outperforming current automatic metrics.

Technology Category

Application Category

📝 Abstract
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
Problem

Research questions and friction points this paper is trying to address.

radiology report evaluation
clinical significance
diagnostic correctness
patient safety
LLM-based metric
Innovation

Methods, ideas, or system contributions that make the work stand out.

clinically grounded evaluation
severity-aware weighting
error taxonomy
radiology report generation
LLM-based metric
🔎 Similar Papers
No similar papers found.
Mohammed Baharoon
Mohammed Baharoon
Harvard Medical School
Computer VisionMultimodal LearningUnsupervised LearninigFoundation Models
T
Thibault Heintz
Department of Radiation Oncology, Mass General Brigham, Boston, MA
S
Siavash Raissi
Department of Biomedical Informatics, Harvard Medical School, Boston, MA
M
Mahmoud Alabbad
King Fahad Hospital, Al-Ahsa Health Cluster, Al Hofuf, Saudi Arabia
M
Mona Alhammad
Ras-Tanura General Hospital, Ministry of Health, Eastern Province, Saudi Arabia
H
Hassan AlOmaish
Department of Medical Imaging, King Abdulaziz Medical City, Ministry of National Guard, Riyadh, Saudi Arabia
S
Sung Eun Kim
Department of Biomedical Informatics, Harvard Medical School, Boston, MA
Oishi Banerjee
Oishi Banerjee
PhD. Student, Harvard University
AI for MedicineAI for Healthcare
P
Pranav Rajpurkar
Department of Biomedical Informatics, Harvard Medical School, Boston, MA