🤖 AI Summary
Existing public colorectal cancer (CRC) histopathological image datasets suffer from insufficient morphological diversity, severe class imbalance, and inconsistent image quality, limiting the generalizability of multi-class tissue classification models. To address these limitations, we introduce the first large-scale, high-quality, and balanced nine-class CRC histopathology dataset—comprising 70,000 patches per class. We further propose DeepCluster++, a novel framework integrating a histopathology-specific autoencoder, K-means clustering, and equal-frequency binning to ensure intra-class diversity and enable semi-automatic annotation, substantially reducing manual labeling effort. The framework is extensible to other cancer histopathology datasets. Extensive experiments demonstrate that CNNs and Transformers trained on our dataset achieve state-of-the-art performance across both classification and segmentation tasks, outperforming prior benchmarks. These results empirically validate the critical impact of high-quality data and principled dataset construction paradigms on model performance.
📝 Abstract
Multi-class tissue-type classification of colorectal cancer (CRC) histopathologic images is a significant step in the development of downstream machine learning models for diagnosis and treatment planning. However, existing public CRC datasets often lack morphologic diversity, suffer from class imbalance, and contain low-quality image tiles, limiting model performance and generalizability. To address these issues, we introduce STARC-9 (STAnford coloRectal Cancer), a large-scale dataset for multi-class tissue classification. STARC-9 contains 630,000 hematoxylin and eosin-stained image tiles uniformly sampled across nine clinically relevant tissue classes (70,000 tiles per class) from 200 CRC patients at the Stanford University School of Medicine. The dataset was built using a novel framework, DeepCluster++, designed to ensure intra-class diversity and reduce manual curation. First, an encoder from a histopathology-specific autoencoder extracts feature vectors from tiles within each whole-slide image. Then, K-means clustering groups morphologically similar tiles, followed by equal-frequency binning to sample diverse morphologic patterns within each class. The selected tiles are subsequently verified by expert gastrointestinal pathologists to ensure accuracy. This semi-automated process significantly reduces manual effort while producing high-quality, diverse tiles. To evaluate STARC-9, we benchmarked convolutional neural networks, transformers, and pathology-specific foundation models on multi-class CRC tissue classification and segmentation tasks, showing superior generalizability compared to models trained on existing datasets. Although we demonstrate the utility of DeepCluster++ on CRC as a pilot use-case, it is a flexible framework that can be used for constructing high-quality datasets from large WSI repositories across a wide range of cancer and non-cancer applications.