HEARTS: Benchmarking LLM Reasoning on Health Time Series

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess the reasoning capabilities of large language models (LLMs) on real-world, multimodal health signals with strong temporal dependencies. To address this gap, this work introduces HEARTS, a unified benchmark integrating 16 real-world datasets spanning 12 health domains and 20 physiological signal modalities, along with 110 structured tasks. HEARTS establishes the first hierarchical evaluation framework specifically designed for temporal reasoning on health data and conducts a systematic assessment of 14 state-of-the-art LLMs across more than 20,000 samples. The study reveals systematic deficiencies in LLMs’ complex temporal reasoning: their performance lags significantly behind specialized models, shows weak correlation with general reasoning ability, and relies heavily on heuristic strategies that degrade markedly as temporal complexity increases. HEARTS provides an extensible, living evaluation platform to advance reliable health AI.

Technology Category

Application Category

📝 Abstract
The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.
Problem

Research questions and friction points this paper is trying to address.

health time series
large language models
reasoning benchmark
temporal dependencies
physiological modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

health time series
large language models
reasoning benchmark
temporal reasoning
hierarchical reasoning
🔎 Similar Papers
No similar papers found.