🤖 AI Summary
This work addresses the limitations of existing multimodal large models in computational pathology, which lack structured knowledge integration and interpretable memory mechanisms, thereby struggling to consistently adhere to diagnostic standards. Inspired by the hierarchical memory processes of human pathologists, we propose a memory-centric multimodal framework that stores structured pathological knowledge in long-term memory and introduces a cognition-aligned memory transformer to enable context-aware, dynamic transfer into working memory. The architecture integrates multimodal memory activation with knowledge anchoring, facilitating fine-grained and interpretable reasoning. Evaluated on WSI-Bench, our model achieves state-of-the-art performance, improving WSI-Precision and WSI-Relevance by 12.8% and 10.1%, respectively, in report generation, and by 9.7% and 8.9% in open-ended diagnostic tasks.
📝 Abstract
Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.