π€ AI Summary
Existing evaluation benchmarks for personalized assistants struggle to capture the complex external environments and user cognitive states inherent in real-world interactions, particularly lacking assessments of long-term, multi-turn intent understanding and dynamic user modeling. To address this gap, this work proposes LifeSim, a user simulator that, for the first time, integrates the Belief-Desire-Intention (BDI) cognitive model into user behavior simulation, generating coherent life trajectories within a physical environment to drive multi-turn interactions. The authors further introduce LifeSim-Eval, a comprehensive benchmark spanning eight daily-life domains and 1,200 scenarios, designed to evaluate assistantsβ capabilities in explicit and implicit intent comprehension, user profile reconstruction, and response quality. Experimental results reveal significant deficiencies in current large language models regarding implicit intent handling and long-term preference modeling, thereby validating the necessity and effectiveness of the proposed benchmark.
π Abstract
The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.