A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the insufficient evaluation of large language models’ (LLMs) personalized reasoning and generation capabilities in multi-turn dialogues. We introduce PersonaConvBench, the first large-scale benchmark integrating explicit persona modeling with multi-turn dialogue structure. It spans ten Reddit domains and supports three core tasks: sentence classification, influence regression, and user-centric text generation. Our contributions are threefold: (1) the first systematic integration of explicit persona representations with dynamic dialogue context; (2) a unified, cross-domain, multi-task, user-centered evaluation framework; and (3) a fine-grained assessment protocol with prompt alignment for both commercial and open-source LLMs. Experiments demonstrate that incorporating personalized dialogue history substantially improves model performance—e.g., sentiment classification accuracy increases by 198% over non-dialogue baselines. The dataset, code, and full experimental results are publicly released.

Technology Category

Application Category

📝 Abstract

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

Problem

Research questions and friction points this paper is trying to address.

Evaluating personalized reasoning in multi-turn LLM conversations

Integrating personalization and conversational structure in benchmarks

Analyzing how context shapes LLM outputs in diverse scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates personalization and conversational structure tasks

Uses Reddit-based domains for diverse evaluation

Incorporates personalized history for performance improvement

🔎 Similar Papers

No similar papers found.