RadEval: A framework for radiology text evaluation

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Automated evaluation of radiology reports lacks a unified, open-source, and reproducible benchmark framework; existing metrics (e.g., BLEU, ROUGE, BERTScore, F1CheXbert, RaTEScore) are fragmented and non-standardized. Method: We propose RadEval—the first open-source framework integrating traditional n-gram matching, clinical concept alignment, and large language model–driven assessment, supporting multimodal imaging inputs and statistical significance testing. Contributions/Results: (1) A standardized evaluation taxonomy; (2) A lightweight, domain-adapted variant of the GREEN model; (3) A radiology-specific encoder pretrained on clinical text; (4) A high-quality dataset with 450+ expert-annotated clinical error categories. Extensive validation across multiple public benchmarks demonstrates strong agreement between RadEval scores and radiologist judgments (Spearman’s ρ > 0.92), significantly enhancing robustness and reproducibility of radiology report evaluation.

Technology Category

Application Category

📝 Abstract

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

Problem

Research questions and friction points this paper is trying to address.

Develops a unified framework for evaluating radiology text quality

Standardizes and refines diverse metrics for clinical report assessment

Provides tools and datasets for robust radiology report benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified open-source framework consolidating diverse evaluation metrics

Refined standardized implementations with lightweight multimodal GREEN extension

Domain-specific radiology encoder pretrained for zero-shot retrieval performance

🔎 Similar Papers

No similar papers found.