STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing public colorectal cancer (CRC) histopathological image datasets suffer from insufficient morphological diversity, severe class imbalance, and inconsistent image quality, limiting the generalizability of multi-class tissue classification models. To address these limitations, we introduce the first large-scale, high-quality, and balanced nine-class CRC histopathology dataset—comprising 70,000 patches per class. We further propose DeepCluster++, a novel framework integrating a histopathology-specific autoencoder, K-means clustering, and equal-frequency binning to ensure intra-class diversity and enable semi-automatic annotation, substantially reducing manual labeling effort. The framework is extensible to other cancer histopathology datasets. Extensive experiments demonstrate that CNNs and Transformers trained on our dataset achieve state-of-the-art performance across both classification and segmentation tasks, outperforming prior benchmarks. These results empirically validate the critical impact of high-quality data and principled dataset construction paradigms on model performance.

Technology Category

Application Category

📝 Abstract
Multi-class tissue-type classification of colorectal cancer (CRC) histopathologic images is a significant step in the development of downstream machine learning models for diagnosis and treatment planning. However, existing public CRC datasets often lack morphologic diversity, suffer from class imbalance, and contain low-quality image tiles, limiting model performance and generalizability. To address these issues, we introduce STARC-9 (STAnford coloRectal Cancer), a large-scale dataset for multi-class tissue classification. STARC-9 contains 630,000 hematoxylin and eosin-stained image tiles uniformly sampled across nine clinically relevant tissue classes (70,000 tiles per class) from 200 CRC patients at the Stanford University School of Medicine. The dataset was built using a novel framework, DeepCluster++, designed to ensure intra-class diversity and reduce manual curation. First, an encoder from a histopathology-specific autoencoder extracts feature vectors from tiles within each whole-slide image. Then, K-means clustering groups morphologically similar tiles, followed by equal-frequency binning to sample diverse morphologic patterns within each class. The selected tiles are subsequently verified by expert gastrointestinal pathologists to ensure accuracy. This semi-automated process significantly reduces manual effort while producing high-quality, diverse tiles. To evaluate STARC-9, we benchmarked convolutional neural networks, transformers, and pathology-specific foundation models on multi-class CRC tissue classification and segmentation tasks, showing superior generalizability compared to models trained on existing datasets. Although we demonstrate the utility of DeepCluster++ on CRC as a pilot use-case, it is a flexible framework that can be used for constructing high-quality datasets from large WSI repositories across a wide range of cancer and non-cancer applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited morphologic diversity in colorectal cancer histopathology datasets
Solving class imbalance and low-quality image issues in CRC tissue classification
Developing automated framework for creating diverse, high-quality pathology datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepCluster++ framework ensures intra-class diversity
K-means clustering groups morphologically similar image tiles
Equal-frequency binning samples diverse morphologic patterns
🔎 Similar Papers
No similar papers found.
B
Barathi Subramanian
Department of Pathology, Stanford University, USA
R
Rathinaraja Jeyaraj
Department of Pathology, Stanford University, USA
M
Mitchell Nevin Peterson
Department of Electrical Engineering, Stanford University, USA
T
Terry Guo
Department of Pathology, Stanford University, USA
Nigam Shah
Nigam Shah
Professor of Medicine, and Biomedical Data Science, Stanford University
ontologydata miningmedical informaticsBiomedical Informatics
C
Curtis Langlotz
Department of Radiology, Stanford University, USA
A
Andrew Y. Ng
Department of Computer Science, Stanford University, USA
J
Jeanne Shen
Department of Pathology, Stanford University, USA