AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Current large language models lack high-quality evaluation benchmarks grounded in real-world dialogues, hindering effective assessment of their capabilities in personalized memory and preference alignment. To address this gap, this work introduces the first structured memory evaluation benchmark built upon WildChat, a dataset of authentic long-term human–AI conversations. The benchmark encompasses four core tasks—memory extraction, updating, retrieval, and utilization—and incorporates both explicit and implicit personalization signals. Through human-validated memory annotations, a multi-task evaluation framework, and a distractor-based testing mechanism, the study systematically evaluates model performance across the entire memory lifecycle. Experimental results reveal significant limitations in state-of-the-art models, particularly in implicit feature extraction, memory update capacity, robustness to interference during retrieval, and generation of responses aligned with user sentiment.
📝 Abstract
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
Problem

Research questions and friction points this paper is trying to address.

LLM personalization
real-dialogue memorization
preference alignment
evaluation benchmark
memory management
Innovation

Methods, ideas, or system contributions that make the work stand out.

personalization benchmark
real-dialogue memorization
preference alignment
memory management
large language models
🔎 Similar Papers