Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual scoring of clinical communication skills in Objective Structured Clinical Examinations (OSCEs) is time-intensive and prone to subjective bias. Method: This study systematically evaluates the feasibility of large language models (LLMs)—GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro—for automated OSCE scoring using the Master Interview Rating Scale (MIRS). We benchmark zero-shot, chain-of-thought, few-shot, and multi-step prompting across 10 standardized cases and 174 expert-validated ratings. A novel three-tier accuracy framework—exact, off-by-one, and thresholded—is introduced to assess performance across all MIRS items. Results: LLMs achieve accuracy ranging from 0.67 to 0.88 (off-by-one and thresholded); GPT-4o at zero temperature attains internal consistency (Cronbach’s α = 0.98). Performance remains stable across OSCE stations and communication domains. This work establishes the first reproducible, high-consistency AI-assisted baseline for clinical communication assessment, providing a methodological foundation for automating evaluation in medical education.

Technology Category

Application Category

📝 Abstract
Introduction. Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). Methods. We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting. The models were benchmarked against a dataset of 10 OSCE cases with 174 expert consensus scores available. Model performance was measured using three accuracy metrics (exact, off-by-one, thresholded). Results. Averaging across all MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to 0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater reliability ($alpha = 0.98$ for GPT-4o). CoT, few-shot, and multi-step techniques proved valuable when tailored to specific assessment items. The performance was consistent across MIRS items independent of encounter phases and communication domains. Conclusion. We demonstrated the feasibility of AI-assisted OSCE evaluation and provided benchmarking of multiple LLMs across multiple prompt techniques. Our work provides a baseline performance assessment for LLMs that lays a foundation for future research in automated assessment of clinical communication skills.
Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence
Medical Education Assessment
Communication Skills Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Automated Scoring in OSCE
Clinical Communication Skills Assessment
🔎 Similar Papers
No similar papers found.
J
Jadon Geathers
Y
Yann Hicke
C
Colleen Chan
Niroop Rajashekar
Niroop Rajashekar
Yale School of Medicine
J
Justin Sewell
S
Susannah Cornes
Rene F. Kizilcec
Rene F. Kizilcec
Associate Professor, Cornell University
EducationArtificial IntelligenceTeaching and LearningHCI
D
Dennis Shung