WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) in computational pathology are limited to patch-level whole-slide image (WSI) analysis, failing to capture global morphological patterns and thereby losing critical diagnostic information. To address this, we propose the first end-to-end framework for fine-grained WSI understanding. Our method comprises: (1) constructing WSI-Bench, a morphology-aware benchmark; (2) designing a three-stage alignment training paradigm—WSI-text alignment, feature-space alignment, and instruction tuning; (3) introducing a morphology-driven visual encoder and a pathology-knowledge-enhanced visual question answering (VQA) architecture; and (4) defining pathology-specific evaluation metrics—WSI-Precision and WSI-Relevance. Extensive experiments on WSI-Bench demonstrate consistent and significant outperformance over state-of-the-art MLLMs, with marked improvements in morphological pattern recognition. Moreover, our model’s performance strongly correlates with clinical diagnostic accuracy, validating its translational relevance.

Technology Category

Application Category

📝 Abstract

Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal models for whole slide image analysis

Addressing limitations in morphological feature recognition

Enhancing diagnostic accuracy through better WSI understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage training for gigapixel WSI analysis

Large-scale morphology-aware benchmark WSI-Bench

Specialized WSI metrics for pathological assessment

🔎 Similar Papers

No similar papers found.