Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes HistoSelect, a coarse-to-fine retrieval framework that mimics pathologists’ visual search and attention patterns during whole-slide image interpretation—a strategy previously unexplored in computational pathology. Existing visual question answering models for gigapixel histopathology slides often fail to efficiently focus on diagnostically relevant regions due to uniform sampling or global attention mechanisms, leading to missed critical visual evidence. HistoSelect addresses this by first localizing question-relevant tissue regions through a query-guided, tissue-aware coarse search, then selecting high-information patches from these regions for fine-grained reasoning. Evaluated on 356,000 question-answer pairs, HistoSelect reduces visual token usage by 70% on average while achieving state-of-the-art accuracy across three pathology VQA tasks, with generated answers aligning closely with the regions pathologists attend to.

Technology Category

Application Category

📝 Abstract
Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
Problem

Research questions and friction points this paper is trying to address.

computational pathology
whole slide image
visual question answering
tissue-aware reasoning
gigapixel image
Innovation

Methods, ideas, or system contributions that make the work stand out.

question-guided retrieval
tissue-aware sampling
coarse-to-fine reasoning
whole slide image
vision-language model
🔎 Similar Papers
No similar papers found.