🤖 AI Summary
Medical large language models (LLMs) suffer from knowledge gaps and hallucination, while existing retrieval-augmented generation (RAG) and tool-augmented approaches are limited by suboptimal retrieval utilization and non-traceable reasoning, undermining diagnostic reliability. To address this, we propose Deep-DxSearch—a novel end-to-end agent-based RAG system that formalizes the LLM as an intelligent agent and the medical knowledge base as its environment. For the first time, we employ multi-objective reinforcement learning to jointly optimize retrieval policies, reasoning structure, and diagnostic accuracy, enabling guided and fully traceable clinical decision-making. Our system is trained on a large-scale corpus of real-world electronic health records and authoritative medical sources, with rewards incorporating structured output fidelity, retrieval quality, reasoning path consistency, and diagnostic accuracy. Experiments demonstrate that Deep-DxSearch significantly outperforms prompt-engineering and training-free RAG baselines across multiple benchmarks, achieving superior performance over strong models—including GPT-4o and DeepSeek-R1—in both common and rare disease diagnosis, while ensuring high accuracy and clinical interpretability.
📝 Abstract
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL.
Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.