DeVisE: Behavioral Testing of Medical Large Language Models

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing clinical LLM evaluations struggle to distinguish genuine medical reasoning from superficial pattern matching. To address this, we propose a counterfactual behavioral testing framework grounded in real-world ICU discharge summaries, systematically perturbing demographic and vital-sign variables to assess model sensitivity to subtle clinical changes and reasoning consistency. Our contributions include: (1) the first fine-grained behavioral testing paradigm explicitly designed for clinical reasoning; (2) a dual-version counterfactual dataset combining real-data-driven curation with synthetic augmentation; and (3) empirical identification of an inherent trade-off between fairness and responsiveness in fine-tuned models. Experiments on MIMIC-IV evaluate input sensitivity and downstream tasks such as hospital-length prediction, comparing zero-shot and fine-tuned LLMs. Results show zero-shot models exhibit more coherent counterfactual reasoning, whereas fine-tuned models are more stable but less responsive; persistent demographic biases underscore the critical need for rigorous fairness evaluation in clinical LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLMs' genuine clinical reasoning vs superficial patterns

Testing sensitivity to demographic and vital sign variations in clinical notes

Assessing fairness and stability in LLM outputs for medical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral testing framework for clinical LLMs

Controlled single-variable counterfactual dataset

Evaluates input sensitivity and downstream reasoning

🔎 Similar Papers

No similar papers found.