PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

📅 2024-11-08

🏛️ NEJM AI

📈 Citations: 5

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing chest X-ray (CXR) datasets lack manually curated, fine-grained annotations required for image-localized radiology report generation (GRRG). To address this, we introduce the first bilingual (English–Spanish) lung-localized CXR report generation dataset, comprising 4,555 images with expert-annotated positive/negative findings, multi-reader bounding boxes, anatomical location labels, and progression status. Our methodology integrates report generation with visual localization via expert annotation, dual-reader consensus verification, anatomy-constrained bounding boxes, bilingual sentence alignment, and standardized medical terminology. The dataset contains 7,037 positive and 3,422 negative sentences, covering 3,099 abnormal and 1,456 normal CXRs. It supports both binary (positive/negative) text generation and cross-lingual clinical interpretation. Fully open-sourced, this is the first GRRG benchmark that jointly ensures high localization accuracy, linguistic diversity, and clinical reliability.

Technology Category

Application Category

📝 Abstract

Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

Problem

Research questions and friction points this paper is trying to address.

Lack of manually annotated chest X-ray datasets for grounded radiology report generation

Need for bilingual training data with localized findings in radiology images

Absence of comprehensive datasets with positive and negative finding annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual dataset with grounded reports

Manual annotations for positive and negative findings

Bounding boxes for localization of findings

🔎 Similar Papers

No similar papers found.