🤖 AI Summary
This work addresses the challenge of incomplete tool-call parameters arising from users’ omission of critical details during interactions with large language model (LLM) agents. To study cross-session personalized tool calling, the authors introduce MPT, a benchmark comprising 265 multi-turn dialogues, and propose PRefine—a test-time method that leverages a memory-augmented mechanism to model user preferences as dynamically evolving hypotheses. PRefine employs a generate–verify–refine loop to extract reusable constraints from historical dialogues, innovatively focusing on memorizing the rationale behind user choices rather than merely recording the choices themselves. Using only 1.24% of historical prompt tokens, this approach significantly improves tool-call accuracy, demonstrating that reasoning-based preference memory is crucial for robust personalization in LLM-agent interactions.
📝 Abstract
Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate--verify--refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.