🤖 AI Summary
Existing foundation models in computational pathology struggle to explicitly model the molecular states of tissues. To address this limitation, this work proposes MINT, a novel framework that, for the first time, leverages spatial transcriptomics (ST) data as a cross-modal supervisory signal during fine-tuning of vision Transformers on histopathology images. MINT introduces learnable ST tokens to encode multi-scale gene expression profiles and integrates them with morphological features through a feature anchoring strategy that freezes the pretrained encoder while employing DINO-based self-distillation. This approach effectively fuses morphological and molecular information while mitigating catastrophic forgetting. Evaluated on 577 HEST samples, MINT achieves a Pearson correlation coefficient of 0.440 for gene expression prediction and a performance score of 0.803 on the EVA general pathology benchmark, outperforming current state-of-the-art methods.
📝 Abstract
Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.