🤖 AI Summary
This study addresses the lack of safety evaluation regarding medical information consistency in large language models (LLMs) when delivering health advice to diverse users. The authors propose MIRA, the first bilingual benchmark for auditing medical responses, comprising 4,320 multidimensional prompts derived from 60 expert-reviewed medical questions to systematically assess model consistency across variations in language, register, and health literacy. Their analysis reveals a previously undocumented phenomenon termed “Differential Information Dilution” (DID), wherein mainstream LLMs significantly omit critical medical details when responding to low–health literacy prompts. To mitigate this issue, the paper introduces a knowledge-guided prompting strategy that effectively enhances informational completeness, reducing excessive simplification by approximately 8% in Claude and 6% in Qwen.
📝 Abstract
Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).