🤖 AI Summary
This work proposes a multitask, multimodal supervised framework to address the challenge of integrating heterogeneous data such as whole-slide images and clinical records. Built upon a linear-complexity multiple instance learning (MIL) backbone, the method leverages graph neural networks to extract histopathological features, standardizes clinical data into unified embeddings, and explicitly decomposes shared and modality-specific representations to enable effective cross-modal alignment and fusion. Notably, it introduces the Mamba architecture into multimodal pathological analysis for the first time, constructing an efficient Mamba-based MIL encoder. Evaluated on CAMELYON16 and TCGA-NSCLC, the approach improves classification accuracy by 2.1–6.6% and AUC by 2.2–6.9%. Across five TCGA survival cohorts, it achieves significantly higher concordance indices (C-index), outperforming unimodal and other multimodal methods by 7.1–9.8% and 5.6–7.1%, respectively.
📝 Abstract
Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.