Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine

📅 2024-06-04

📈 Citations: 6

✨ Influential: 0

🤖 AI Summary

Existing LLM evaluations of medical reasoning over-rely on multiple-choice questions (MCQs), making them vulnerable to pattern recognition and test-taking heuristics rather than genuine clinical reasoning. Method: We introduce Glianorex—the first fictional medical benchmark—featuring bilingual (Chinese-English) invented organs to enable zero-shot clinical reasoning assessment, thereby isolating reasoning from real-world knowledge confounds. We employ multilingual LLMs to auto-generate textbooks and exams, complemented by ablation studies, interpretability techniques, and cross-model zero-shot evaluation. Results: Top LLMs achieve a mean score of 64%—significantly exceeding the physician baseline (27%)—revealing heavy reliance on superficial cues and hallucinated reasoning; domain-adapted medical models show only marginal gains in English tasks. This work pioneers the fictional medical benchmark paradigm, exposing systematic overestimation in MCQ-based evaluation and establishing a more rigorous, disentangled standard for assessing LLM clinical reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) such as ChatGPT demonstrate significant potential in the medical domain and are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE. However, such benchmarks may overestimate true clinical understanding by rewarding pattern recognition and test-taking heuristics. To investigate this, we created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability. We generated textbooks and MCQs in English and French using leading LLMs, then evaluated proprietary, open-source, and domain-specific models in a zero-shot setting. Despite the fictional content, models achieved an average score of 64%, while physicians scored only 27%. Fine-tuned medical models outperformed base models in English but not in French. Ablation and interpretability analyses revealed that models frequently relied on shallow cues, test-taking strategies, and hallucinated reasoning to identify the correct choice. These results suggest that standard MCQ-based evaluations may not effectively measure clinical reasoning and highlight the need for more robust, clinically meaningful assessment methods for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating clinical reasoning vs pattern recognition in LLMs

Assessing validity of MCQ-based medical benchmarks

Identifying shallow cues and test-taking strategies in model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created fictional medical benchmark with imaginary organ

Evaluated models using zero-shot setting in multiple languages

Analyzed models' reliance on shallow cues and strategies

🔎 Similar Papers

No similar papers found.

Authors to Follow