🤖 AI Summary
Existing health AI benchmarks lack personalized evaluation frameworks tailored to daily decision-making for diabetic patients, failing to reflect large language models’ (LLMs) real-world supportive capabilities. To address this gap, we introduce DexBench—the first LLM benchmark dedicated to diabetes self-management—constructed from 15,000 patients’ real-world longitudinal physiological and behavioral data, comprising 360,000 personalized question-answer instances across seven task categories: glycemic interpretation, behavior–glucose association, long-term planning, and more. We propose a novel multidimensional evaluation framework assessing accuracy, safety, actionability, credibility, and clarity, and systematically benchmark eight state-of-the-art LLMs. Results reveal pronounced performance imbalances across models, with no single model dominating across all dimensions. DexBench fills a critical void in patient-facing health AI evaluation, providing both a rigorous assessment tool and actionable insights to enhance LLM reliability and practical utility in metabolic health management.
📝 Abstract
We present DexBench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DexBench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.