RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the current lack of a high-quality, radiologist-annotated benchmark for evaluating multimodal large language models on chest X-ray interpretation. To this end, the authors propose an AI-assisted expert annotation pipeline that leverages GPT-4o to extract abnormal findings from radiology reports and employs a locally deployed Phi-4-Reasoning model to map these findings to 12 standardized clinical labels. A stratified sampling algorithm is then used to select representative cases, resulting in a publicly released dataset of 200 expert-validated chest X-rays—100 for public use and 100 reserved for independent evaluation. This resource significantly enhances the clinical reliability of multimodal foundation models in cardiothoracic imaging, with deliberate attention to rare pathologies and multi-label complexity.

Technology Category

Application Category

📝 Abstract
Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked"Agree all","Agree mostly"or"Disagree"to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected"Agree All"for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.
Problem

Research questions and friction points this paper is trying to address.

chest radiographs
multimodal large language models
benchmark dataset
radiologist labeling
cardiothoracic disease
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted labeling
multimodal LLM benchmark
radiologist validation
chest radiograph dataset
expert-curated evaluation
🔎 Similar Papers
No similar papers found.
Y
Yishu Wei
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
A
Adam E. Flanders
Department of Radiology, Thomas Jefferson University, Philadelphia, PA, USA
E
E. Colak
Department of Medical Imaging, St. Michael’s Hospital/Unity Health Toronto, University of Toronto, Toronto, ON, Canada
J
John Mongan
Department of Radiology and Biomedical Imaging; Division of Clinical Informatics and Digital Transformation, Department of Medicine, University of California, San Francisco, CA, USA
L
Luciano M Prevedello
Department of Radiology, Ohio State University Wexner Medical Center, OH, USA
P
Po-Hao Chen
Diagnostics Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
H
Henrique Min Ho Lee
Hospital Israelita Albert Einstein, Av. Albert Einstein, 627, São Paulo 05652, Brazil
Gilberto Szarf
Gilberto Szarf
Professor Adjunto do Departamento de Diagnóstico por Imagem, UNIFESP
Radiologia TorácicaRessonância Magnética do CoraçãoTomografia Computadorizada das Artérias Coronárias
H
Hamilton Shoji
Hospital Israelita Albert Einstein, Av. Albert Einstein, 627, São Paulo 05652, Brazil
J
Jason Sho
Radiological Society of North America, Oak Brook, IL, USA
K
Katherine P Andriole
Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
T
Tessa Cook
Department of Radiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
L
Lisa C. Adams
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University Munich, Munich, Germany
L
Linda C. Chu
Department of Radiology, Johns Hopkins University School of Medicine, Baltimore, MD, USA
M
Maggie Chung
Department of Radiology and Biomedical Imaging; Division of Clinical Informatics and Digital Transformation, Department of Medicine, University of California, San Francisco, CA, USA
G
Geraldine Brusca-Augello
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
D
D. Deva
Department of Medical Imaging, St. Michael’s Hospital/Unity Health Toronto, University of Toronto, Toronto, ON, Canada
N
Navneet Singh
Trillium Health Partners, Department of Medical Imaging, Faculty of Medicine, University of Toronto
F
Felipe Sanchez Tijmes
Joint Department of Medical Imaging, Toronto General Hospital, University of Toronto, Toronto, ON, Canada
J
J. Alpert
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
E
E. Nguyen
Joint Department of Medical Imaging, Toronto General Hospital, University of Toronto, Toronto, ON, Canada
D
Drew A. Torigian
Department of Radiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
Kate Hanneman
Kate Hanneman
University Health Network, University of Toronto
Cardiac Imaging
L
Lauren K Groner
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
A
Alexander Phan
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
A
Ali Islam
St. Joseph’s Health Care London, Western University, London, ON
M
Matias F.Callejas
Department of Medical Imaging, St. Michael’s Hospital/Unity Health Toronto, University of Toronto, Toronto, ON, Canada
G
G. Teles
Hospital Israelita Albert Einstein, Av. Albert Einstein, 627, São Paulo 05652, Brazil
F
Faisal Jamal
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University Munich, Munich, Germany
M
Maryam Vazirabad
Radiological Society of North America, Oak Brook, IL, USA
A
Ali Tejani
Department of Materials Science and Engineering, Faculty of Engineering, University of Toronto
Hari Trivedi
Hari Trivedi
Emory University
Deep LearningRadiologyMammographyAINatural Language Processing
P
Paulo Kuriki
Department of Radiology, UT Southwestern Medical Center, Dallas, TX, USA
R
Rajesh Bhayana
Department of Radiology, UT Southwestern Medical Center, Dallas, TX, USA
E
Elana T. Benishay
Department of Radiology, Weill Cornell Medicine, New York, NY , USA
Y
Yi Lin
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY , USA
Yifan Peng
Yifan Peng
Associate Professor at Weill Cornell Medicine
NLPCVmachine learning
G
George Shih
Department of Radiology, Weill Cornell Medicine, New York, NY , USA