🤖 AI Summary
A lack of open-source dialogue benchmarks tailored for personalized AI assistant research hinders systematic evaluation and development of large language models’ (LLMs) personalization capabilities.
Method: We introduce HiCUPID—the first open-source, multi-turn personalized dialogue dataset supporting user profiling and long-term memory awareness—and develop a learnable automatic evaluation model based on fine-tuned Llama-3.2. We further propose an end-to-end personalized assistant evaluation framework integrating user profile injection, memory consistency verification, and alignment with human preferences.
Contribution/Results: HiCUPID significantly improves the efficiency, reproducibility, and interpretability of personalized response evaluation. It establishes a unified, scalable benchmark platform for LLM personalization research, enabling rigorous, standardized assessment of adaptive, context-aware, and user-consistent behavior in conversational agents.
📝 Abstract
Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at https://github.com/12kimih/HiCUPID.