🤖 AI Summary
This work addresses the “proactivity gap” in long-lived large language model (LLM) agents, which stems from their inability to actively acquire user preferences that are not explicitly stated but may be needed in future interactions, thereby limiting cross-session capabilities. The study formally defines and quantifies this issue, introducing the novel “Ask-to-Remember” (ATR) task and presenting ATRBench—the first benchmark framework for evaluating proactive preference elicitation. ATRBench isolates the agent’s active inquiry capability by concealing ground-truth preferences, employing multi-session task designs, and using controlled simulation environments, shifting evaluation focus from mere memory retention to strategic questioning. Experiments across eight state-of-the-art LLM agents reveal that default strategies underperform an oracle by at least 62 points, with prompt engineering yielding only marginal gains, underscoring preference acquisition as the critical bottleneck.
📝 Abstract
A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask-to-Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user's preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.