Multi-Modal Foundation Models for Computational Pathology: A Survey

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the challenge of multimodal fusion modeling in computational pathology—integrating H&E whole-slide images, clinical text, knowledge graphs, and molecular features. We systematically survey 32 state-of-the-art multimodal foundation models and propose the first pathology-specific taxonomy of multimodal paradigms: vision–language, vision–knowledge graph, and vision–gene expression—distinguishing large language model (LLM)-based from non-LLM vision–language architectures. We curate a pathology-oriented multimodal dataset inventory and downstream task classification, consolidating 28 benchmark datasets and key techniques including contrastive learning, cross-modal alignment, instruction tuning, knowledge graph embedding, and multi-omics integration. The resulting comprehensive technical map spans model architectures, data resources, training strategies, evaluation benchmarks, and open challenges—establishing an authoritative reference framework for AI-driven pathology research and development.

Technology Category

Application Category

📝 Abstract

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

Problem

Research questions and friction points this paper is trying to address.

Explores multi-modal foundation models in computational pathology.

Reviews models integrating visual, textual, and molecular data.

Categorizes and analyzes datasets and downstream tasks in pathology.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal models integrate diverse data sources.

Focus on H&E stained whole slide images.

Categorize models into vision-language, knowledge, gene.

🔎 Similar Papers

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model