π€ AI Summary
This work addresses the lack of systematic evaluation benchmarks for large language models (LLMs) in long-term, cross-dimensional personal health assistant tasks. To this end, the authors introduce LifeAgentBench, a large-scale question-answering benchmark comprising 22,573 questions, along with LifeAgentβa strong baseline agent featuring a multi-step evidence retrieval and deterministic aggregation architecture. This framework establishes the first standardized evaluation protocol for long-horizon, cross-dimensional reasoning in digital health. Leveraging an extensible benchmark construction pipeline and a unified evaluation protocol, the study systematically assesses 11 prominent LLMs, uncovering critical limitations in their ability to perform long-term aggregation and cross-dimensional inference. LifeAgent substantially outperforms existing approaches, demonstrating practical utility in real-world health scenarios.
π Abstract
Personalized digital health support requires long-horizon, cross-dimensional reasoning over heterogeneous lifestyle signals, and recent advances in mobile sensing and large language models (LLMs) make such support increasingly feasible. However, the capabilities of current LLMs in this setting remain unclear due to the lack of systematic benchmarks. In this paper, we introduce LifeAgentBench, a large-scale QA benchmark for long-horizon, cross-dimensional, and multi-user lifestyle health reasoning, containing 22,573 questions spanning from basic retrieval to complex reasoning. We release an extensible benchmark construction pipeline and a standardized evaluation protocol to enable reliable and scalable assessment of LLM-based health assistants. We then systematically evaluate 11 leading LLMs on LifeAgentBench and identify key bottlenecks in long-horizon aggregation and cross-dimensional reasoning. Motivated by these findings, we propose LifeAgent as a strong baseline agent for health assistant that integrates multi-step evidence retrieval with deterministic aggregation, achieving significant improvements compared with two widely used baselines. Case studies further demonstrate its potential in realistic daily-life scenarios. The benchmark is publicly available at https://anonymous.4open.science/r/LifeAgentBench-CE7B.