🤖 AI Summary
Existing clinical multiple-choice question (MCQ) benchmarks suffer from insufficient difficulty, limiting their ability to rigorously evaluate the clinical decision-making reliability of large language models (LLMs).
Method: We propose a knowledge graph–guided distractor generation framework that integrates the UMLS and MeSH biomedical knowledge graphs with multi-hop semantic random walks to identify semantically related yet factually incorrect reasoning paths. A path-aware prompting strategy then directs LLaMA and Med-PaLM to generate clinically plausible yet highly misleading distractors.
Contribution/Results: This work is the first to synergistically combine structured knowledge graph reasoning with LLM prompting for controllable, high-fidelity distractor generation—achieving a fine-grained balance between clinical plausibility and diagnostic迷惑ness. Evaluated across six major medical QA benchmarks, our framework consistently degrades state-of-the-art model accuracy and significantly enhances the robustness of clinical diagnostic capability assessment, establishing a new standard for evaluating clinical LLM reliability.
📝 Abstract
Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.