🤖 AI Summary
This study addresses a critical gap in the evaluation of large language models (LLMs) for medical question answering: current benchmarks rely heavily on standardized medical exam questions, which fail to capture the erroneous assumptions and latent risks commonly present in real-world patient queries. To bridge this gap, the authors construct the first systematic dataset of authentic patient questions—derived from Google’s “People Also Ask” section for the top 200 prescribed medications in the U.S.—and demonstrate that incorrect premises are non-randomly distributed and significantly correlated with the severity of prior misconceptions. Through web crawling, manual annotation, and quantitative analysis, they evaluate leading LLMs and find that despite strong performance on conventional medical benchmarks, these models struggle to identify and appropriately respond to flawed premises in real-life scenarios, revealing a substantial disconnect between existing evaluation frameworks and practical clinical needs.
📝 Abstract
Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.