🤖 AI Summary
Existing NSCLC digital pathology datasets suffer from narrow cohort coverage, absence of metastatic lesion annotations, and lack of molecular biomarker data (e.g., PD-L1 expression). To address these limitations, we introduce the first open-source, multicenter, multi-stain (H&E and PD-L1 immunohistochemistry) whole-slide image dataset for NSCLC, comprising 887 fully annotated regions of interest from 155 patients. Annotations span three hierarchical levels: tissue compartments (16 classes), individual nuclei, and PD-L1-positive tumor cells. Notably, this is the first publicly available resource featuring manually annotated H&E images of metastatic sites paired with corresponding PD-L1 expression data. Images were acquired across diverse digital slide scanners to ensure technical generalizability. This dataset enables robust benchmarking for NSCLC tissue segmentation, nuclear detection, and computational analysis of the tumor immune microenvironment, establishing a critical foundation for AI-driven pathological diagnosis and quantitative biomarker assessment.
📝 Abstract
The tumor immune microenvironment (TIME) in non-small cell lung cancer (NSCLC) histopathology contains morphological and molecular characteristics predictive of immunotherapy response. Computational quantification of TIME characteristics, such as cell detection and tissue segmentation, can support biomarker development. However, currently available digital pathology datasets of NSCLC for the development of cell detection or tissue segmentation algorithms are limited in scope, lack annotations of clinically prevalent metastatic sites, and forgo molecular information such as PD-L1 immunohistochemistry (IHC). To fill this gap, we introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated NSCLC whole-slide images. We publicly release 887 fully annotated regions of interest from 155 unique patients across three complementary tasks: (i) multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC, (ii) nuclei detection, and (iii) PD-L1 positive tumor cell detection in PD-L1 IHC slides. To the best of our knowledge, this is the first public NSCLC dataset with manual annotations of H&E in metastatic sites and PD-L1 IHC.