🤖 AI Summary
This study addresses the sensitivity of large language models (LLMs) to question phrasing in medical question-answering, where identical clinical evidence can elicit contradictory responses. For the first time, it systematically evaluates LLM consistency under retrieval-augmented generation (RAG) when confronted with positively versus negatively framed queries and varying linguistic styles. Using 6,614 expert-constructed query pairs derived from clinical trial abstracts, experiments across eight mainstream LLMs reveal that positive–negative framing significantly increases the likelihood of contradictory answers—a effect exacerbated in multi-turn dialogues—whereas changes in linguistic style show no significant impact. These findings highlight the risk of framing-induced inconsistencies and underscore the urgent need to enhance LLM robustness in high-stakes clinical settings.
📝 Abstract
Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.