🤖 AI Summary
Current computational pathology methods predominantly rely on task-specific models requiring extensive labeled data and exhibiting poor generalizability; mainstream CLIP-style models adopt image-text dual-encoder architectures with limited cross-modal modeling capacity and lack a unified multimodal evaluation benchmark. To address these limitations, we propose the first general-purpose multimodal embedding framework for pathology. Our method introduces (1) a novel MLLM-based paradigm for universal pathological embedding, enabling fine-grained cross-modal alignment and fusion; (2) PMEB—the first dedicated multimodal embedding benchmark for pathology—comprising 15 downstream tasks across three meta-tasks: retrieval, classification, and compositional retrieval; and (3) comprehensive evaluation demonstrating consistent and significant improvements over CLIP and other baselines across 14 datasets and 15 tasks, thereby validating its effectiveness and generalizability as a foundational multimodal representation model for pathology.
📝 Abstract
Pathology plays a critical role in diagnosing a wide range of diseases, yet existing approaches often rely heavily on task-specific models trained on extensive, well-labeled datasets. These methods face sustainability challenges due to the diversity of pathologies and the labor-intensive nature of data collection. To address these limitations, we highlight the need for universal multimodal embeddings that can support multiple downstream tasks. Previous approaches often involve fine-tuning CLIP-based models, which handle images and text separately, limiting their ability to capture complex multimodal relationships. Additionally, these models are evaluated across diverse datasets without a unified benchmark for assessing multimodal embeddings in pathology. To address these challenges, we propose MLLM4PUE, a novel framework that leverages Multimodal Large Language Models (MLLMs) to generate Pathology Universal Embeddings. The MLLM4PUE framework not only facilitates robust integration of images and text but also enhances understanding and fusion capabilities across various tasks. We further introduce the Pathology Multimodal Embedding Benchmark (PMEB), a comprehensive benchmark designed to assess the quality of pathology multimodal embeddings. PMEB comprises 15 original tasks drawn from 14 datasets, organized into three meta-tasks: retrieval, classification, and composed retrieval. Experimental results demonstrate the superiority of MLLM4PUE, illustrating MLLM-based models can effectively support a wide range of downstream tasks and unify the research direction for foundation models in pathology.