RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This study addresses the current lack of a high-quality, radiologist-annotated benchmark for evaluating multimodal large language models on chest X-ray interpretation. To this end, the authors propose an AI-assisted expert annotation pipeline that leverages GPT-4o to extract abnormal findings from radiology reports and employs a locally deployed Phi-4-Reasoning model to map these findings to 12 standardized clinical labels. A stratified sampling algorithm is then used to select representative cases, resulting in a publicly released dataset of 200 expert-validated chest X-rays—100 for public use and 100 reserved for independent evaluation. This resource significantly enhances the clinical reliability of multimodal foundation models in cardiothoracic imaging, with deliberate attention to rare pathologies and multi-label complexity.

Technology Category

Application Category

📝 Abstract

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked"Agree all","Agree mostly"or"Disagree"to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected"Agree All"for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

Problem

Research questions and friction points this paper is trying to address.

chest radiographs

multimodal large language models

benchmark dataset

radiologist labeling

cardiothoracic disease

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted labeling

multimodal LLM benchmark

radiologist validation