Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to chest X-ray report generation and evaluation struggle to detect low-prevalence abnormalities and inadequately model critical clinical semantics such as negation and uncertainty. To address these limitations, this work proposes a multi-label finding extraction framework that integrates radiologist guidance with large language models to accurately identify key imaging findings from free-text reports. Building upon this framework, we introduce Ran Score—the first clinician-guided, finding-level automatic evaluation metric for chest X-ray reports. Through prompt engineering and enhanced multi-label natural language understanding, our method achieves a macro-averaged F1 score of 0.956 on the MIMIC-CXR-EN development set, outperforming CheXbert by 15.7 percentage points, and demonstrates strong generalization capability on the Chinese ChestX-CN dataset.

Technology Category

Application Category

📝 Abstract
Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
Problem

Research questions and friction points this paper is trying to address.

radiology report generation
low-prevalence abnormalities
clinical language
negation
automated evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ran Score
clinician-guided framework
large language models
finding-level evaluation
radiology report generation
🔎 Similar Papers
No similar papers found.
R
Ran Zhang
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
Y
Yucong Lin
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China; Zhengzhou Research Institute, Beijing Institute of Technology, Zhengzhou 450003, China
Z
Zhaoli Su
School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
Bowen Liu
Bowen Liu
Andreessen Horowitz, insitro, Stanford
Computational ChemistryDrug DiscoveryGraph Machine Learning
Danni Ai
Danni Ai
北京理工大学
医学图像处理,手术导航,虚拟现实与增强现实
Tianyu Fu
Tianyu Fu
Ph.D at Tsinghua University
efficient AILLMsparse computation
Deqiang Xiao
Deqiang Xiao
Assistant Professor, Beijing Institute of Technology (BIT)
Computer Aided Surgical Navigation/PlanningMedical Image AnalysisComputer Vision
Jingfan Fan
Jingfan Fan
Beijing Institute of Technology
Medical Image ProcessingComputer Vision
Y
Yuanyuan Wang
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
M
Mingwei Gao
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
Y
Yuwan Hu
Department of Radiology, China-Japan Friendship Hospital (Institute of Clinical Medical Sciences), Beijing 100029, China; Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100730, China
S
Shuya Gao
Department of Radiology, Peking University China-Japan Friendship School of Clinical Medicine, Beijing 100029, China
J
Jingtao Li
Department of Gastroenterology, China-Japan Friendship Hospital, Beijing 100029, China; NHC Key Laboratory of Clinical Big Data Standardization & Integration, Beijing 100029, China
Jian Yang
Jian Yang
Rutgers University, New Jersey Institute of Technology
combinotorial optimizationinventory controlcompetitive pricinggame-theoretic applications
H
Hong Song
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
H
Hongliang Sun
Department of Radiology, China-Japan Friendship Hospital, Beijing 100029, China