🤖 AI Summary
This study systematically investigates the representational gap between pathology-specific foundation models (e.g., UNI, Virchow2, Prov-GigaPath) and general-purpose vision models (ImageNet/LVD-pretrained ViTs) on cell instance segmentation and classification. We propose a frozen-encoder architecture with multi-depth embedding fusion in the decoder, jointly generating semantic and distance maps for end-to-end cell-level segmentation and type classification. For the first time, we quantitatively evaluate performance disparities across PanNuke, CoNIC, and our newly introduced CytoDArk0 dataset. Pathology-specific models yield substantial gains: +4.2% in cell detection F1-score and +3.8% in segmentation mAP, with particularly pronounced improvements for rare cell types. Our core contributions are twofold: (1) empirical evidence demonstrating the critical role of domain knowledge in fine-grained cellular analysis, and (2) a novel cross-depth embedding fusion decoding paradigm that advances clinical adaptation of medical imaging foundation models.
📝 Abstract
Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain-specific histopathology foundation models over general-purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi-level patch embeddings applied to cell instance segmentation and classification. We implement an encoder-decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre-trained on ImageNet-22K or LVD-142M, representing general-purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov-GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole-slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post-processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell-type classification. All encoders remain frozen during training to assess their pre-trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl-stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance-level detection, segmentation accuracy, and cell-type classification. This study provides insights into the comparative strengths and limitations of general-purpose vs. histopathology foundation models, offering guidance for model selection in cell-focused histopathology and brain cytoarchitecture analysis workflows.