Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing LLM clinical evaluations predominantly rely on single-turn knowledge recall, failing to capture the hypothesis-driven, iterative, and information-gradual nature of real-world clinical reasoning. Method: We introduce VivaBench—the first multi-turn clinical reasoning benchmark simulating oral examinations (viva voce), constructed from 1,762 physician-annotated cases. It requires models to actively query, select diagnostic tests, and iteratively refine diagnoses under incomplete information. Contribution/Results: VivaBench is the first benchmark to systematically assess LLMs’ dynamic reasoning under uncertainty and susceptibility to cognitive biases. Experiments reveal significant deficiencies in leading LLMs—including hypothesis fixation, suboptimal test ordering, and premature closure—resulting in marked performance degradation across dialogue turns. These findings expose a fundamental gap between LLM behavior and human clinical cognition, establishing a new paradigm for trustworthy evaluation and capability advancement of medical AI.

Technology Category

Application Category

📝 Abstract

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating clinical reasoning in LLMs through simulated oral examinations

Assessing sequential diagnostic reasoning under information uncertainty

Identifying failure modes in LLM decision-making mirroring clinical errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates oral exams for clinical reasoning evaluation

Uses multi-turn benchmark with physician-curated vignettes

Tests iterative diagnostic reasoning under uncertainty

🔎 Similar Papers

No similar papers found.