🤖 AI Summary
This work addresses the challenge in computational pathology that existing models often generate histopathology reports lacking clinical concept alignment and diagnostic consistency due to the ultra-high resolution, multi-scale heterogeneity, and stringent interpretability requirements of whole-slide images (WSIs). To overcome this, the authors propose a context-aware multimodal Transformer framework that innovatively integrates depth-aware contextual modulation with adaptive multimodal fusion. Within a unified learning paradigm, the model dynamically fuses local tissue patterns, global WSI context, and expert-defined semantic diagnostic concepts to enable progressive visual representation refinement and clinically trustworthy report generation. Evaluated on TCGA-BRCA, MICCAI REG, and HistAI datasets using CONCH1.5 features, the proposed architecture consistently outperforms baseline methods—including WSI-Caption, HistGen, and BiGen—achieving state-of-the-art performance across BLEU-1 to BLEU-4, METEOR, and ROUGE-L metrics.
📝 Abstract
Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.