🤖 AI Summary
Pathological visual language models (VLMs) suffer from hallucinations inconsistent with visual evidence due to the ultra-high resolution of histopathological images, complex tissue architectures, and fine-grained semantic distinctions—severely undermining clinical trustworthiness. To address this, we propose a multimodal agent-based retrieval-augmented generation (RAG) framework: (1) a page-level, jointly embedded图文 library of authoritative pathology textbooks enables precise cross-modal retrieval; (2) an agent-based reasoning mechanism supports task decomposition, multi-step vision–text collaborative search, and iterative interaction. Unlike conventional text-only RAG approaches, our method deeply integrates both visual and semantic cues from domain-specific educational resources, substantially mitigating hallucination. Evaluated on challenging tasks—including multiple-choice diagnosis and visual question answering—our framework achieves significantly higher accuracy and clinical utility compared to state-of-the-art baselines.
📝 Abstract
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.