🤖 AI Summary
To address the limitation in whole-slide image (WSI) analysis where tile representations lack global context and struggle to jointly optimize local- and slide-level tasks, this paper proposes TICON—a Transformer-based universal tile contextualizer. Its core innovation is the first slide-level masked tile modeling pretraining paradigm, which achieves agnostic, unified enhancement of outputs from any tile encoder via cross-model embedding alignment and a lightweight slide-level aggregator. Pretrained on only 11K WSIs, TICON surpasses state-of-the-art (SOTA) slide aggregators trained on 350K samples. It establishes new SOTA performance across HEST-Bench, THUNDER, and CATCH (tile-level benchmarks) and Patho-Bench (slide-level benchmark), significantly improving downstream tasks including classification, segmentation, and survival prediction.
📝 Abstract
The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.