A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A systematic multimodal benchmark for jointly modeling histological images and spatial gene expression data remains lacking in spatial transcriptomics. Method: We introduce HESCAPE—the first large-scale cross-modal benchmark encompassing paired spatial omics data from 54 donors across six distinct gene panels—and propose the first cross-modal contrastive pretraining framework tailored for spatial transcriptomics. Contribution/Results: Our analysis reveals the dominant role of the gene encoder in cross-modal representation alignment; demonstrates that spatially pretrained gene models substantially outperform non-spatially pretrained counterparts, yet suffer severe batch effects that impede alignment; and shows that contrastive pretraining improves gene mutation classification but degrades gene expression prediction accuracy—highlighting an inherent trade-off between downstream tasks. HESCAPE establishes a standardized evaluation benchmark and delivers critical design insights for multimodal spatial omics methodology.

Technology Category

Application Category

📝 Abstract
Spatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community
Problem

Research questions and friction points this paper is trying to address.

Lacks benchmarks for multimodal learning in spatial transcriptomics
Evaluates cross-modal pretraining for gene and image data
Identifies batch effects hindering cross-modal alignment effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal contrastive pretraining for spatial transcriptomics
Benchmarking gene expression and histology image encoders
Batch-robust multimodal learning approaches
🔎 Similar Papers
2024-04-19International Conference on Medical Image Computing and Computer-Assisted InterventionCitations: 7
R
Rushin H. Gindra
Computational Health Center, Helmholtz Munich, Munich, Germany
Giovanni Palla
Giovanni Palla
Chan Zuckerberg Initiative
Computational biologyMachine Learning
M
Mathias Nguyen
Computational Health Center, Helmholtz Munich, Munich, Germany
Sophia J. Wagner
Sophia J. Wagner
Technical University Munich, Helmholtz AI
computational pathologydeep learningcomputer vision
M
Manuel Tran
Computational Health Center, Helmholtz Munich, Munich, Germany; School of Computation, Information and Technology, Technical University Munich, Munich, Germany
Fabian J Theis
Fabian J Theis
Helmholtz Munich, Technical University of Munich
computational biologymachine learning
D
Dieter Saur
School of Medicine and Health, Technical University Munich, Munich, Germany
Lorin Crawford
Lorin Crawford
Microsoft Research
StatisticsMachine LearningGeneticsCancer GenomicsTopological Data Analysis
Tingying Peng
Tingying Peng
Groupleader at Helmholtz AI, Helmholtz Zentrum Muechen
biomedical image processingsupervised learningiomedical image processingsupervised learning