Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) struggle to effectively model the dynamic evolution of user personas and generate consistently personalized responses over extended interactions. To address this gap, we introduce PERSONAMEM, a novel benchmark comprising 180 simulated user personas and 60-turn multi-turn dialogues spanning 15 task categories—constituting the first systematic evaluation of LLMs’ ability to model temporal user preferences. We further construct the first personalized response selection benchmark explicitly designed for temporal user memory modeling and develop a scalable simulation pipeline for controllable persona generation and dialogue synthesis. Empirical evaluation reveals that state-of-the-art models—including GPT-4.1, o4-mini, and Gemini-2.0—achieve only ~50% accuracy on this task, exposing critical limitations in their user awareness and long-horizon interactive modeling capabilities. These findings underscore a fundamental challenge in persistent personalization and highlight the need for architectures that robustly encode, update, and leverage evolving user representations across time.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to internalize user traits and preferences
Tracking evolution of user profiling and preferences over time
Generating personalized responses in new scenarios accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

PERSONAMEM benchmark for dynamic user profiling
Simulated user-LLM interaction histories evaluation
Pipeline for user-aware chatbot development
🔎 Similar Papers
No similar papers found.