🤖 AI Summary
This study investigates the alignment between physician trainees and large language models (LLMs) in sentence-level relevance assessment for medical question answering, and its impact on task performance. To this end, we introduce MedPAIR—a novel benchmark comprising 1,300 medical QA pairs annotated with fine-grained sentence-level relevance judgments by 36 physician trainees. Our systematic quantitative analysis reveals, for the first time, a consistent divergence between LLMs’ relevance assessments and human clinical judgment. Crucially, removing sentences deemed irrelevant by physicians significantly improves QA accuracy for both humans and LLMs. Integrating human annotation, relevance modeling, LLM reasoning analysis, and downstream performance attribution, we identify misalignment in information prioritization as a fundamental bottleneck limiting LLMs’ medical reasoning capability. All annotated data are publicly released to support reproducible research.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these"relevant"subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.