PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address spatial detail loss, excessive memory consumption, and high training costs of Vision Transformers (ViTs) in whole-slide image (WSI) analysis due to ultra-high resolution, this paper proposes a vector-quantization-based feature distillation framework. Methodologically, it introduces (1) multi-scale vector quantization (MSVQ), a novel self-supervised pretraining objective that jointly achieves 64× patch-level feature compression and high-fidelity spatial reconstruction; and (2) a progressive convolutional module to strengthen slide-level semantic representation learning. Evaluated on multiple WSI benchmarks, the framework achieves state-of-the-art performance while reducing token storage and training overhead by nearly 200× and significantly improving reconstruction fidelity. This work establishes an efficient, scalable paradigm for high-resolution computational pathology modeling.

Technology Category

Application Category

📝 Abstract
Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT$_{224}$). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.
Problem

Research questions and friction points this paper is trying to address.

Addresses high storage and training costs in WSI analysis.
Improves spatial detail retention in pathology foundation models.
Enhances WSI analysis with vector quantization and multi-scale strategies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector quantized distillation compresses patch tokens.
Multi-scale VQ enhances reconstruction and SSL.
Progressive convolutional module extracts rich spatial information.
🔎 Similar Papers
No similar papers found.
Honglin Li
Honglin Li
Westlake University
Computer VisionMultimodal LLMbiomedical image analysis
Zhongyi Shui
Zhongyi Shui
Ph.D. Candidate,Westlake University & Zhejiang University
Y
Yunlong Zhang
Zhejiang University, School of Engineering, Westlake University
C
Chenglu Zhu
Research Center for Industries of the Future, School of Engineering, Westlake University
L
Lin Yang
Research Center for Industries of the Future, School of Engineering, Westlake University