🤖 AI Summary
This work addresses the limitations of existing diabetes management platforms, which offer only static glucose summaries and lack support for natural language queries over continuous glucose monitoring (CGM) data. Direct use of large language models (LLMs) poses risks of privacy leakage and unreliable outputs. To overcome these challenges, the authors propose CGM-Agent, a novel framework that leverages an LLM as a reasoning controller to dynamically select and invoke local, deterministic analytical functions, ensuring sensitive data remains on-device. The study introduces the first benchmark dataset tailored for CGM question answering and demonstrates strong performance, achieving 94% accuracy on synthetic queries and 88% on real user queries. These results validate the feasibility of deploying lightweight, privacy-preserving models at the edge. The code and dataset are publicly released.
📝 Abstract
Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.