Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing approaches to chest X-ray report generation and evaluation struggle to detect low-prevalence abnormalities and inadequately model critical clinical semantics such as negation and uncertainty. To address these limitations, this work proposes a multi-label finding extraction framework that integrates radiologist guidance with large language models to accurately identify key imaging findings from free-text reports. Building upon this framework, we introduce Ran Score—the first clinician-guided, finding-level automatic evaluation metric for chest X-ray reports. Through prompt engineering and enhanced multi-label natural language understanding, our method achieves a macro-averaged F1 score of 0.956 on the MIMIC-CXR-EN development set, outperforming CheXbert by 15.7 percentage points, and demonstrates strong generalization capability on the Chinese ChestX-CN dataset.

Technology Category

Application Category

📝 Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

Problem

Research questions and friction points this paper is trying to address.

radiology report generation

low-prevalence abnormalities

clinical language

negation

automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ran Score

clinician-guided framework

large language models