Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the challenges of cell-level dense prediction in computational pathology—namely, fine structural complexity, strong domain shifts, and high annotation costs—which are exacerbated by Vision Transformers (ViTs) whose patch tokenization disrupts spatial continuity and impairs local morphological detail preservation. To overcome these limitations, we propose CMD, a self-supervised convolutional generative pretraining framework built upon a fully convolutional ConvNeXt-UNet backbone. CMD performs masked diffusion pretraining directly in pixel space and adaptively integrates features from frozen histopathology foundation models via adaptive normalization. Departing from the prevailing ViT paradigm, our approach demonstrates that purely convolutional architectures can yield high-performance pathology foundation models, significantly outperforming existing ViT-based methods across multiple dense prediction tasks and even surpassing state-of-the-art end-to-end segmentation models, while achieving exceptional generalization and robustness with minimal fine-tuning.

📝 Abstract

Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

Problem

Research questions and friction points this paper is trying to address.

cell-level dense prediction

computational pathology

domain shift

dense annotation

spatial continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked-Diffusion

Convolutional Foundation Model

Cell-Level Dense Prediction