MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates the alignment between physician trainees and large language models (LLMs) in sentence-level relevance assessment for medical question answering, and its impact on task performance. To this end, we introduce MedPAIR—a novel benchmark comprising 1,300 medical QA pairs annotated with fine-grained sentence-level relevance judgments by 36 physician trainees. Our systematic quantitative analysis reveals, for the first time, a consistent divergence between LLMs’ relevance assessments and human clinical judgment. Crucially, removing sentences deemed irrelevant by physicians significantly improves QA accuracy for both humans and LLMs. Integrating human annotation, relevance modeling, LLM reasoning analysis, and downstream performance attribution, we identify misalignment in information prioritization as a fundamental bottleneck limiting LLMs’ medical reasoning capability. All annotated data are publicly released to support reproducible research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these"relevant"subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.

Problem

Research questions and friction points this paper is trying to address.

Evaluates alignment of LLMs and physicians in medical QA relevance

Compares relevance prioritization between AI and physician trainees

Assesses impact of relevance filtering on QA accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing MedPAIR dataset for relevance alignment

Comparing physician and LLM relevance estimates

Filtering irrelevant sentences improves accuracy

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions