VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing agent evaluation benchmarks struggle to assess the ability to model user preferences and proactively acquire information during long-term, fragmented interactions. To address this gap, this work introduces the first benchmark specifically designed for evaluating agents in sustained user collaboration. It systematically incorporates two key evaluation dimensions—personalization modeling and proactive behavior—through temporally structured, heterogeneous task sequences. The benchmark features interactive tasks embedded with dynamic user preferences and an extensible memory interface, enabling controlled comparisons across diverse memory architectures. Evaluations with state-of-the-art large language models reveal limited performance in authentic personalized scenarios, highlighting significant bottlenecks in continuously modeling evolving preferences and proactively completing missing information, thereby offering clear directions for future research.

📝 Abstract

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

Problem

Research questions and friction points this paper is trying to address.

personalization

proactiveness

long-term interaction

user preference

agent benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized agents

proactive interaction

long-term user modeling