HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical QA datasets inadequately evaluate large language models’ (LLMs) capabilities in multi-step clinical reasoning and decision-path modeling. To address this, we propose the first systematic benchmark construction methodology grounded in clinically validated decision paths: a semi-automated pipeline transforms decision-tree structures from authoritative medical knowledge sources into realistic patient cases, yielding a high-quality QA dataset comprising 4,063 cases across 17 medical specialties, each annotated with complete reasoning chains. The dataset supports both open-ended and multiple-choice questions, enabling interpretable, structured evaluation. Critically, it is the first to jointly assess multi-step reasoning and retrieval-augmented generation (RAG) performance. Our benchmark significantly enhances the reliability of LLM evaluation in high-stakes clinical settings and establishes a novel standard for medical education and AI-powered clinical decision support.

Technology Category

Application Category

📝 Abstract
HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs' multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex reasoning in medical LLMs via QA datasets
Generating clinically-grounded QA data from decision pathways
Assessing multi-step inference in structured RAG contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated pipeline for dataset generation
Clinically validated reasoning chains foundation
Structured design for multi-step inference evaluation
🔎 Similar Papers
No similar papers found.