LifeAgentBench: A Multi-dimensional Benchmark and Agent for Personal Health Assistants in Digital Health

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation benchmarks for large language models (LLMs) in long-term, cross-dimensional personal health assistant tasks. To this end, the authors introduce LifeAgentBench, a large-scale question-answering benchmark comprising 22,573 questions, along with LifeAgent—a strong baseline agent featuring a multi-step evidence retrieval and deterministic aggregation architecture. This framework establishes the first standardized evaluation protocol for long-horizon, cross-dimensional reasoning in digital health. Leveraging an extensible benchmark construction pipeline and a unified evaluation protocol, the study systematically assesses 11 prominent LLMs, uncovering critical limitations in their ability to perform long-term aggregation and cross-dimensional inference. LifeAgent substantially outperforms existing approaches, demonstrating practical utility in real-world health scenarios.

Technology Category

Application Category

📝 Abstract

Personalized digital health support requires long-horizon, cross-dimensional reasoning over heterogeneous lifestyle signals, and recent advances in mobile sensing and large language models (LLMs) make such support increasingly feasible. However, the capabilities of current LLMs in this setting remain unclear due to the lack of systematic benchmarks. In this paper, we introduce LifeAgentBench, a large-scale QA benchmark for long-horizon, cross-dimensional, and multi-user lifestyle health reasoning, containing 22,573 questions spanning from basic retrieval to complex reasoning. We release an extensible benchmark construction pipeline and a standardized evaluation protocol to enable reliable and scalable assessment of LLM-based health assistants. We then systematically evaluate 11 leading LLMs on LifeAgentBench and identify key bottlenecks in long-horizon aggregation and cross-dimensional reasoning. Motivated by these findings, we propose LifeAgent as a strong baseline agent for health assistant that integrates multi-step evidence retrieval with deterministic aggregation, achieving significant improvements compared with two widely used baselines. Case studies further demonstrate its potential in realistic daily-life scenarios. The benchmark is publicly available at https://anonymous.4open.science/r/LifeAgentBench-CE7B.

Problem

Research questions and friction points this paper is trying to address.

digital health

large language models

benchmark

personalized health assistant

cross-dimensional reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LifeAgentBench

cross-dimensional reasoning

long-horizon health reasoning