Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit critical clinical unreliability when responding to real-world cancer patient queries containing erroneous presuppositions—a gap unaddressed by existing medical evaluation benchmarks. Method: To address the lack of adversarial medical contexts in current assessments, we introduce Cancer-Myth, the first expert-validated adversarial dataset (585 items) for evaluating presupposition identification and correction, benchmarking state-of-the-art models including GPT-4o, Gemini-1.Pro, and Claude-3.5-Sonnet. Contribution/Results: All models corrected erroneous presuppositions in fewer than 30% of cases; notably, GPT-4-Turbo achieved high overall response quality (4.13/5) but failed almost entirely on presupposition recognition. Multi-model comparison—rigorously validated via oncology expert review and medical agent analysis—reveals systematic neglect or reinforcement of false medical premises. This work quantifies, for the first time, the “presupposition blindness” flaw in clinical LLM dialogues, establishing a novel benchmark and methodology for safety evaluation and trustworthiness enhancement of AI-powered healthcare conversational systems.

Technology Category

Application Category

📝 Abstract
Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, Gemini-1.Pro, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI chatbots on cancer patient questions with false assumptions
Assessing LLM reliability in handling clinical misinformation risks
Identifying gaps in AI correction of medical presuppositions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs on real patient cancer questions
Introducing expert-verified adversarial dataset Cancer-Myth
Testing frontier LLMs on false presupposition correction
🔎 Similar Papers
No similar papers found.
Wang Bill Zhu
Wang Bill Zhu
University of Southern California
natural language processingvision-and-languagemachine learning
T
Tianqi Chen
Thomas Lord Department of Computer Science, USC
C
Ching Ying Lin
Keck School of Medicine, USC
J
Jade Law
Keck School of Medicine, USC
M
Mazen Jizzini
Keck School of Medicine, USC
J
Jorge J. Nieva
Keck School of Medicine, USC
Ruishan Liu
Ruishan Liu
University of Southern California
machine learningcomputational healthcomputational biology
Robin Jia
Robin Jia
University of Southern California
natural language processing