🤖 AI Summary
This work addresses the lack of clinical grounding, standardized data, and diagnostic workflow alignment in existing pathological visual question answering (VQA) methods. To bridge this gap, the authors construct a Clinical Pathology Report Template (CPRT) based on the College of American Pathologists (CAP) cancer protocols, enabling systematic extraction of pathological features. They introduce CTIS-Bench—the first slide-level VQA benchmark clinically aligned with real-world diagnostic practices—and curate the CTIS-Align dataset to facilitate vision–language alignment training. Furthermore, they propose CTIS-QA, a dual-stream architecture that mimics pathologists’ “global screening–local focusing” visual strategy. Experiments demonstrate that CTIS-QA significantly outperforms state-of-the-art methods on WSI-VQA, CTIS-Bench, and multiple downstream diagnostic tasks.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated strong performance in patch-level pathological image analysis; however, they often lack the holistic perceptual capability necessary for comprehensive Whole Slide Image (WSI) interpretation. Recent approaches have explored constructing slide-level MLLMs using VQA datasets that are entirely generated from pathology reports by large language models (LLMs). However, these datasets suffer from critical limitations: hallucinated content, information leakage in question stems, clinically irrelevant or visual independent questions, and the omission of essential diagnostic features-issues that undermine both data quality and clinical validity. In this paper, we introduce a clinical diagnosis template-based pipeline to collect pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTISBench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists' diagnostic approach. One stream captures global slidelevel context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. We will fully release both CTIS-Bench and CTIS-QA as open-source resources.