Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing clinical multiple-choice question (MCQ) benchmarks suffer from insufficient difficulty, limiting their ability to rigorously evaluate the clinical decision-making reliability of large language models (LLMs). Method: We propose a knowledge graph–guided distractor generation framework that integrates the UMLS and MeSH biomedical knowledge graphs with multi-hop semantic random walks to identify semantically related yet factually incorrect reasoning paths. A path-aware prompting strategy then directs LLaMA and Med-PaLM to generate clinically plausible yet highly misleading distractors. Contribution/Results: This work is the first to synergistically combine structured knowledge graph reasoning with LLM prompting for controllable, high-fidelity distractor generation—achieving a fine-grained balance between clinical plausibility and diagnostic迷惑ness. Evaluated across six major medical QA benchmarks, our framework consistently degrades state-of-the-art model accuracy and significantly enhances the robustness of clinical diagnostic capability assessment, establishing a new standard for evaluating clinical LLM reliability.

Technology Category

Application Category

📝 Abstract

Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing clinical MCQ benchmarks with knowledge-guided distractor generation

Generating clinically plausible yet misleading distractors for LLM evaluation

Reducing LLM accuracy via medical knowledge graph-based deceptive choices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-guided data augmentation framework

Multi-step walks on medical knowledge graph

Generates clinically plausible misleading distractors

🔎 Similar Papers

No similar papers found.

Authors to Follow