🤖 AI Summary
Current health AI evaluation benchmarks lack standardized descriptions of user queries, limiting their ability to accurately reflect model applicability in real-world clinical settings. This study systematically identifies this “validity gap” and proposes adapting clinical trial reporting standards to create structured query profiles. Leveraging large language models, we automatically annotated 18,707 health-related queries from six public benchmarks using a 16-dimensional taxonomy capturing clinical context, topic, and intent. Our analysis reveals significant structural biases: existing benchmarks severely underrepresent complex diagnostic information such as laboratory tests, imaging, and raw clinical notes; safety-critical scenarios (e.g., self-harm) constitute less than 0.7% of queries; and coverage of pediatric, geriatric, and chronic disease populations is markedly insufficient—highlighting a substantial misalignment between current evaluation frameworks and actual clinical needs.
📝 Abstract
Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use.
Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent.
Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs.
Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.