🤖 AI Summary
Existing retrieval-augmented generation (RAG) methods for medical question answering struggle with hallucination and outdated knowledge in large language models (LLMs), further hindered by irrelevant context, ambiguous queries, and single-source retrieval bias. To address these limitations, we propose RAG²—a novel framework introducing the “reason-guided” paradigm: (1) a lightweight perplexity-supervised filter selects high-quality reasoning chains; (2) the LLM autonomously generates structured, domain-informed rationales to serve as precise retrieval queries; and (3) a balanced, multi-source retrieval mechanism integrates evidence from PubMed, ClinicalTrials.gov, MedlinePlus, and Wikipedia. Evaluated on three established medical QA benchmarks, RAG² outperforms the best existing medical RAG baseline by 5.6% absolute accuracy and improves mainstream LMs’ performance by up to 6.1%. The framework significantly enhances answer reliability, factual consistency, and cross-domain generalizability.
📝 Abstract
Large language models (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG$^2$ (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG$^2$ incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG$^2$ improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1%, and it outperforms the previous best medical RAG model by up to 5.6% across three medical question-answering benchmarks. Our code is available at https://github.com/dmis-lab/RAG2.