π€ AI Summary
This study addresses the susceptibility of large language models (LLMs) to physician inputs in clinical decision-making, which can either enhance diagnostic accuracy or introduce harmful biases, highlighting the inadequacy of current human-AI collaboration evaluation frameworks. The work proposes the first interactive assessment framework that integrates medical case studies with real physician-patientβAI dialogue data to systematically analyze the diagnostic behaviors of multiple LLMs under both expert and adversarial physician contexts. Leveraging multi-turn dialogue simulations, differential diagnosis consistency, WHO harm severity grading, and inference-time scaling, the study reveals a phenotypic spectrum of model responses ranging from compliant to obstinate. Results show that expert context increases the inclusion rate of correct diagnoses by an average of 20.4%, whereas adversarial context significantly degrades performance across 14 models. Inference-time scaling effectively reduces harmful outputs across all severity levels, and explicit uncertainty prompts improve diagnostic accuracy by 15 percentage points in adversarial scenarios.
π Abstract
Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during clinical interactions. We combined 61 New England Journal of Medicine Case Records with 92 real-world clinician-AI interactions to evaluate 21 reasoning LLM variants across 8 frontier models on differential diagnosis generation and next step recommendations under three conditions: reasoning alone, after expert clinician context, and after adversarial clinician context. LLM-clinician concordance increased substantially after clinician exposure, with simulations sharing >=3 differential diagnosis items rising from 65.8% to 93.5% and >=3 next step recommendations from 20.3% to 53.8%. Expert context significantly improved correct final diagnosis inclusion across all 21 models (mean +20.4 percentage points), reflecting both reasoning improvement and passive content echoing, while adversarial context caused significant diagnostic degradation in 14 models (mean -5.4 percentage points). Multi-turn disagreement probes revealed distinct model phenotypes ranging from highly conformist to dogmatic, with adversarial arguments remaining a persistent vulnerability even for otherwise resilient models. Inference-time scaling reduced harmful echoing of clinician-introduced recommendations across WHO-defined harm severity tiers (relative reductions: 62.7% mild, 57.9% moderate, 76.3% severe, 83.5% death-tier). In GPT-4o experiments, explicit clinician uncertainty signals improved diagnostic performance after adversarial context (final diagnosis inclusion 27% to 42%) and reduced alignment with incorrect arguments by 21%. These findings establish a foundation for evaluating clinician-AI collaboration, introducing interactive metrics and mitigation strategies essential for safety and robustness.