🤖 AI Summary
This work addresses the limited generalizability of existing time-series behavioral prediction models in cross-dataset settings, where conventional approaches tend to overfit specific cohorts and large language models (LLMs) struggle to effectively capture long-range heterogeneous temporal signals. To overcome these challenges, the authors propose TimeSRL, a novel framework that introduces, for the first time, a semantic bottleneck mechanism requiring no intermediate annotations to abstract raw time-series data into natural language descriptions and predict behavioral outcomes based on semantic concepts. The framework employs a two-stage LLM architecture optimized end-to-end via a new reinforcement learning strategy combining GRPO and RLVR. Evaluated on anxiety and depression prediction tasks, TimeSRL reduces mean absolute error by 3.1–57.6% (p<0.05) compared to strong baselines, achieves cross-dataset transfer performance nearly on par with in-domain results, and requires no fine-tuning on the target domain.
📝 Abstract
Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.