A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

📅 2024-07-22
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
Existing pathological foundation models are constrained by unimodal (image-only or image-text pair) architectures and patch-level modeling, limiting their ability to integrate clinical text with molecular omics data and lacking whole-slide contextual understanding. This work introduces the first multimodal foundation model tailored for whole-slide pathology, jointly modeling H&E whole-slide images, structured pathology reports, and RNA-Seq data. We propose a novel whole-slide multimodal self-teaching pretraining paradigm (mSTAR), incorporating a multimodal alignment encoder, cross-modal contrastive learning, and whole-slide contextual aggregation. Our model achieves the first joint representation learning of clinical text and molecular phenotypes at the whole-slide granularity. Evaluated across 43 subtasks spanning seven downstream application domains, it consistently outperforms state-of-the-art methods, delivering significant performance gains—particularly in core clinical tasks including diagnosis, subtyping, and prognosis—where whole-slide–level accuracy is critical.

Technology Category

Application Category

📝 Abstract
Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
Problem

Research questions and friction points this paper is trying to address.

Incorporates pathology slides, reports, gene expression data
Addresses lack of whole-slide context in patch-level models
Enhances multimodal foundation models for clinical pathology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal model integrates slides, reports, gene data
Whole-slide pretraining captures comprehensive pathology patterns
mSTAR method enhances patch representation with slide context
🔎 Similar Papers
No similar papers found.
Yingxue Xu
Yingxue Xu
The Hong Kong University of Science and Technology
Multimodal LearningSurvival AnalysisComputational Pathology
Y
Yihui Wang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China.
Fengtao Zhou
Fengtao Zhou
Hong Kong University of Science and Technology
Multimodal LearningComputational Pathology
J
Jiabo Ma
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China.
S
Shu Yang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China.
Huangjing Lin
Huangjing Lin
Imsight Medical Technology, Co., Ltd.
Medical Image AnalysisComputer VisionDeep LearningObject Detection and Segmentation
X
Xin Wang
Department of Surgery, Prince of Wales Hospital, The Chinese University of Hong Kong, Hong Kong, China.
J
Jiguang Wang
Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China. Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China.
Li Liang
Li Liang
The University of Western Australia
3D Point Cloud Processing3D Semantic Scene Completion3D Semantic Scene Generation
A
Anjia Han
Department of Pathology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China.
R
Ronald Cheong Kin Chan
Department of Anatomical and Cellular Pathology, Prince of Wales Hospital, The Chinese University of Hong Kong, Hong Kong SAR, China.
H
Hao Chen
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China. Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China. Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China.