🤖 AI Summary
This study addresses alignment failure in cross-modal registration of spatial transcriptomics (ST) data and histopathological images, caused by spatial distortions, modality heterogeneity, and the high-dimensional sparsity of gene expression data. We propose a ranking-consistency-based gene–image representation learning framework. Our key contributions are: (1) a multi-scale ranking alignment loss that enables robust and interpretable cross-modal geometric matching; and (2) a self-supervised teacher–student distillation architecture, where the teacher network models low-noise representations to suppress noise inherent in gene expression measurements. Evaluated on seven public ST datasets, our method significantly improves performance across downstream tasks—including gene expression prediction, tissue section classification, and survival analysis—while achieving superior alignment accuracy and outperforming state-of-the-art methods on all evaluated metrics.
📝 Abstract
Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.