Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Pathological visual language models (VLMs) suffer from hallucinations inconsistent with visual evidence due to the ultra-high resolution of histopathological images, complex tissue architectures, and fine-grained semantic distinctions—severely undermining clinical trustworthiness. To address this, we propose a multimodal agent-based retrieval-augmented generation (RAG) framework: (1) a page-level, jointly embedded图文 library of authoritative pathology textbooks enables precise cross-modal retrieval; (2) an agent-based reasoning mechanism supports task decomposition, multi-step vision–text collaborative search, and iterative interaction. Unlike conventional text-only RAG approaches, our method deeply integrates both visual and semantic cues from domain-specific educational resources, substantially mitigating hallucination. Evaluated on challenging tasks—including multiple-choice diagnosis and visual question answering—our framework achieves significantly higher accuracy and clinical utility compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations in pathology VLMs due to complex visuals
Enhances retrieval with multimodal text-image search for pathology
Improves diagnostic accuracy via reasoning and multi-turn interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal RAG with page-level textbook embeddings
Joint text-image search for pathology diagnostics
Reinforcement learning for task decomposition and reasoning
🔎 Similar Papers
No similar papers found.