Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current pretraining of digital pathology vision foundation models relies on expert-driven, whole-slide image (WSI)-level data curation, overlooking fine-grained tissue heterogeneity and incurring high annotation costs. Method: We propose a tile-level unsupervised automatic data curation method. Its core is the first large-scale hierarchical clustering-based sampling framework in tile embedding space, which uncovers the intrinsic trade-off between data scale and representation balance. We further design embedding-space balanced sampling and batch-aware optimization strategies tailored to foundation model pretraining. Contribution/Results: Our method achieves significant performance gains across multiple clinically relevant downstream tasks—including tumor classification, subtype prediction, and survival prognosis—demonstrating substantial improvement in pathological representation quality. It establishes a new data-efficient pretraining paradigm for pathology vision foundation models, reducing reliance on manual curation while enhancing generalization and robustness.

Technology Category

Application Category

📝 Abstract
Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving vision foundation models via tile-level data curation
Balancing data size and diversity in pathology image analysis
Enhancing FM performance with unsupervised clustering techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised automatic tile-level data curation
Hierarchical clustering for balanced dataset sampling
Tailored batch sampling to optimize representation quality
🔎 Similar Papers
No similar papers found.
B
Boqi Chen
Dept. of Computer Science, ETH Zurich, Zurich, Switzerland; AI Center, ETH Zurich, Zurich, Switzerland
C
Cédric Vincent-Cuaz
Signal Processing Laboratory (LTS4), EPFL, Lausanne, Switzerland
L
Lydia A. Schoenpflug
University of Basel, Basel, Switzerland
Manuel Madeira
Manuel Madeira
EPFL
Deep LearningGraph Generation
Lisa Fournier
Lisa Fournier
Idiap Research Institute
RNA biology histopathology oncology computational biology
Vaishnavi Subramanian
Vaishnavi Subramanian
Post-doctoral researcher at École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Graph neural networksDigital pathologyMachine learning for medicineMedical data analysis
Sonali Andani
Sonali Andani
ETH Zurich
Machine LearningCancer ResearchComputational PathologyComputational biologyDeep Learning
S
Samuel Ruiperez-Campillo
Dept. of Computer Science, ETH Zurich, Zurich, Switzerland
J
Julia E. Vogt
Dept. of Computer Science, ETH Zurich, Zurich, Switzerland; AI Center, ETH Zurich, Zurich, Switzerland
Raphaëlle Luisier
Raphaëlle Luisier
Bern University, SIB Swiss Institute of Bioinformatics
RNA biologyArtificial IntelligenceData scienceNeuroscienceOncology
Dorina Thanou
Dorina Thanou
EPFL
machine learningsignal processingdata scienceAI for health
V
Viktor H. Koelzer
University of Basel, Basel, Switzerland; University Hospital of Basel, Basel, Switzerland
Pascal Frossard
Pascal Frossard
Ecole Polytechnique Fédérale de Lausanne, EPFL
machine learningimage processingcomputer visionmultimedia communications
Gabriele Campanella
Gabriele Campanella
Icahn School of Medicine at Mount Sinai
Deep LearningComputer VisionComputational Pathology
Gunnar Rätsch
Gunnar Rätsch
Professor, ETH Zürich
Machine LearningComputational BiologyGeneticsCancerHealth Informatics