Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the challenge large language models (LLMs) face in balancing fidelity to counterfactual medical contexts with the imperative of generating safe outputs in high-stakes clinical scenarios. To systematically investigate this issue, the authors introduce MedCounterFact, the first comprehensive dataset constructed by substituting real medical interventions with four categories of counterfactual stimuli—such as nonsensical terms or toxic substances—within randomized controlled trial narratives. Leveraging a multi-model evaluation framework, the experiments reveal that prevailing LLMs consistently and confidently adopt hazardous or implausible counterfactual premises without issuing appropriate safety warnings. This behavior underscores a critical deficiency in current safety alignment mechanisms, highlighting the models’ vulnerability to generating misleading yet assertive medical advice when exposed to manipulated inputs.

Technology Category

Application Category

📝 Abstract

In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual or even adversarial medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such"evidence"at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings reveal that there exists no such boundary yet.

Problem

Research questions and friction points this paper is trying to address.

faithfulness

safety

counterfactual evidence

large language models

medical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual reasoning

medical safety

large language models