Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pathological visual language models (VLMs) suffer from hallucinations inconsistent with visual evidence due to the ultra-high resolution of histopathological images, complex tissue architectures, and fine-grained semantic distinctions—severely undermining clinical trustworthiness. To address this, we propose a multimodal agent-based retrieval-augmented generation (RAG) framework: (1) a page-level, jointly embedded图文 library of authoritative pathology textbooks enables precise cross-modal retrieval; (2) an agent-based reasoning mechanism supports task decomposition, multi-step vision–text collaborative search, and iterative interaction. Unlike conventional text-only RAG approaches, our method deeply integrates both visual and semantic cues from domain-specific educational resources, substantially mitigating hallucination. Evaluated on challenging tasks—including multiple-choice diagnosis and visual question answering—our framework achieves significantly higher accuracy and clinical utility compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations in pathology VLMs due to complex visuals
Enhances retrieval with multimodal text-image search for pathology
Improves diagnostic accuracy via reasoning and multi-turn interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal RAG with page-level textbook embeddings
Joint text-image search for pathology diagnostics
Reinforcement learning for task decomposition and reasoning
🔎 Similar Papers
No similar papers found.
Wenchuan Zhang
Wenchuan Zhang
Sichuan University
Clinical PathologyComputational PathologyBioinformaticsStatistics
J
Jingru Guo
University of Toronto
Hengzhe Zhang
Hengzhe Zhang
Victoria University of Wellington
Genetic ProgrammingAutoML
P
Penghao Zhang
Independent Researcher
J
Jie Chen
Institute of Clinical Pathology, West China Hospital, Sichuan University
S
Shuwan Zhang
Department of Pathology, Shengjing Hospital of China Medical University
Z
Zhang Zhang
Department of Pathology, West China Hospital, Sichuan University
Yuhao Yi
Yuhao Yi
Sichuan University
Optimization and ControlNetworksMachine LearningBioinformatics
H
Hong Bu
Department of Pathology, West China Hospital, Sichuan University