🤖 AI Summary
Current large language models (LLMs) face critical limitations in clinical decision support, particularly in reliably performing multi-hop retrieval and rigorous reasoning over dynamic, heterogeneous medical knowledge sources—including clinical trials, practice guidelines, regulatory documents, and cost data. Existing benchmarks rely heavily on synthetic prompts or single-hop factual queries, failing to capture the complexity and accuracy demands of real-world clinical decision chains.
Method: We introduce the first benchmark designed for authentic clinical scenarios, featuring over 1,000 human-annotated, multi-hop questions spanning complete clinical decision pathways, and the first systematic evaluation of agent factuality and synthesis reliability over live, domain-specific knowledge sources.
Contribution/Results: Experiments reveal that state-of-the-art medical agents achieve only ~10% accuracy, exposing fundamental deficiencies in rigorous clinical information integration. This benchmark provides a quantifiable evaluation framework and concrete directions for advancing model capabilities and toolchain design.
📝 Abstract
Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: https://moreirap12.github.io/mbc-browse-app/