SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
There is a critical need for high-quality, multimodal, publicly available datasets to advance computational pathology and precision oncology in colorectal cancer (CRC). Method: We introduce the first large-scale, publicly accessible multimodal CRC dataset comprising 1,020 H&E-stained whole-slide images, systematically annotated with KRAS/NRAS/BRAF mutation status, mismatch repair (MMR) phenotype, and overall survival data for 426 patients. We propose an integrated imaging–molecular–clinical analytical framework and develop a deep learning–based model for MMR status prediction. Contribution/Results: Our model achieves an AUROC of 0.8316 on an independent test set, demonstrating that histopathological images encode discriminative molecular phenotypic information. This dataset fills a major gap in publicly available, high-fidelity, multidimensionally annotated resources for CRC, enabling biomarker discovery, prognostic modeling, and interpretable AI-driven pathology research.

Technology Category

Application Category

📝 Abstract
$ extbf{Background}$: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. $ extbf{Results}$: We present SurGen, a dataset comprising 1,020 H&E-stained whole slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. To demonstrate SurGen's practical utility, we conducted a proof-of-concept machine learning experiment predicting mismatch repair status from the WSIs, achieving a test AUROC of 0.8316. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer. $ extbf{Conclusions}$: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online at https://doi.org/10.6019/S-BIAD1285.
Problem

Research questions and friction points this paper is trying to address.

Combines histopathological images with genetic data
Predicts mismatch repair status from WSIs
Facilitates biomarker discovery in colorectal cancer
Innovation

Methods, ideas, or system contributions that make the work stand out.

H&E-stained whole slide images
Genetic mutations and survival data
Machine learning for mismatch repair prediction