🤖 AI Summary
To address core challenges in cytology—namely, pronounced staining heterogeneity, scarcity of annotated data, and poor cross-organ generalization—this work introduces the first self-supervised vision foundation model specifically designed for cytology. We innovatively adapt Vision Transformers (ViTs) to cytological analysis by proposing a tailored masked image modeling and self-distillation pretraining paradigm built upon the iBOT framework, alongside a downstream adaptation method integrating attention-guided multiple-instance learning. Our model overcomes the limitations of general-purpose pathology foundation models on cell-level tasks: it achieves significantly higher accuracy and F1 scores than both UNI (a tissue-pathology foundation model) and iBOT pretrained on ImageNet in breast cancer classification and cell-type identification. Visualization analyses confirm its ability to precisely attend to critical morphological features—including nuclear shape and chromatin distribution—demonstrating superior interpretability and biological relevance.
📝 Abstract
Cytology is essential for cancer diagnostics and screening due to its minimally invasive nature. However, the development of robust deep learning models for digital cytology is challenging due to the heterogeneity in staining and preparation methods of samples, differences across organs, and the limited availability of large, diverse, annotated datasets. Developing a task-specific model for every cytology application is impractical and non-cytology-specific foundation models struggle to generalize to tasks in this domain where the emphasis is on cell morphology. To address these challenges, we introduce CytoFM, the first cytology self-supervised foundation model. Using iBOT, a self-supervised Vision Transformer (ViT) training framework incorporating masked image modeling and self-distillation, we pretrain CytoFM on a diverse collection of cytology datasets to learn robust, transferable representations. We evaluate CytoFM on multiple downstream cytology tasks, including breast cancer classification and cell type identification, using an attention-based multiple instance learning framework. Our results demonstrate that CytoFM performs better on two out of three downstream tasks than existing foundation models pretrained on histopathology (UNI) or natural images (iBOT-Imagenet). Visualizations of learned representations demonstrate our model is able to attend to cytologically relevant features. Despite a small pre-training dataset, CytoFM's promising results highlight the ability of task-agnostic pre-training approaches to learn robust and generalizable features from cytology data.