🤖 AI Summary
Automated evaluation of radiology reports lacks a unified, open-source, and reproducible benchmark framework; existing metrics (e.g., BLEU, ROUGE, BERTScore, F1CheXbert, RaTEScore) are fragmented and non-standardized. Method: We propose RadEval—the first open-source framework integrating traditional n-gram matching, clinical concept alignment, and large language model–driven assessment, supporting multimodal imaging inputs and statistical significance testing. Contributions/Results: (1) A standardized evaluation taxonomy; (2) A lightweight, domain-adapted variant of the GREEN model; (3) A radiology-specific encoder pretrained on clinical text; (4) A high-quality dataset with 450+ expert-annotated clinical error categories. Extensive validation across multiple public benchmarks demonstrates strong agreement between RadEval scores and radiologist judgments (Spearman’s ρ > 0.92), significantly enhancing robustness and reproducibility of radiology report evaluation.
📝 Abstract
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.