MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing foundation models in computational pathology struggle to explicitly model the molecular states of tissues. To address this limitation, this work proposes MINT, a novel framework that, for the first time, leverages spatial transcriptomics (ST) data as a cross-modal supervisory signal during fine-tuning of vision Transformers on histopathology images. MINT introduces learnable ST tokens to encode multi-scale gene expression profiles and integrates them with morphological features through a feature anchoring strategy that freezes the pretrained encoder while employing DINO-based self-distillation. This approach effectively fuses morphological and molecular information while mitigating catastrophic forgetting. Evaluated on 577 HEST samples, MINT achieves a Pearson correlation coefficient of 0.440 for gene expression prediction and a performance score of 0.803 on the EVA general pathology benchmark, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.

Problem

Research questions and friction points this paper is trying to address.

pathology foundation models

molecular state

spatial transcriptomics

morphological representation

gene expression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Molecularly Informed Training

Spatial Transcriptomics

Vision Transformer