🤖 AI Summary
This work addresses the challenge of simultaneously achieving answer accuracy, dynamic query optimization, and generation grounded in up-to-date, verifiable evidence in biomedical question answering. The authors propose a three-stage, retrieval-first agent framework that first refines queries containing MeSH terms through a self-evaluation mechanism, then performs iterative, reflection-driven literature retrieval in batches until sufficient evidence is gathered, and finally produces answers with explicit citations. This approach uniquely integrates query self-assessment, adaptive batched retrieval, and evidence-driven generation, ensuring reliability while controlling computational cost. Implemented with GPT-4o and enhanced by MeSH analysis, metadata pre-retrieval, and citation-aware generation, the system achieves 78.32% accuracy on PubMedQA—slightly surpassing human experts—and demonstrates state-of-the-art performance on MMLU clinical tasks and multiple LLM evaluation dimensions, including reasoning rigor, evidential support, clinical relevance, and trustworthiness.
📝 Abstract
Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.