REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the core challenge of modeling emotional intelligence (EI) and personality consistency in long-term, open-domain conversational agents. Methodologically, it introduces the first dialogue dataset derived from 21 days of real-world messaging application logs—replacing synthetic data with authentic human conversations—and establishes two novel benchmark tasks: personality simulation and memory probing. The dataset includes fine-grained emotion annotations, systematic personality consistency analysis, and comparative LLM evaluations. Key contributions are: (1) empirical evidence that current models fail to reliably infer user personality solely from dialogue history; (2) validation that user-specific fine-tuning substantially improves personality fidelity; and (3) discovery that state-of-the-art LLMs exhibit markedly degraded performance on long-horizon memory QA in real-world settings versus synthetic benchmarks—revealing a fundamental limitation in contextual memory grounding under authentic usage conditions.

Technology Category

Application Category

📝 Abstract
Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.
Problem

Research questions and friction points this paper is trying to address.

Long-term conversation capabilities
Emotional intelligence in chatbots
Real-world dialogue patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world dialogue dataset
Persona simulation task
Long-term memory probing
🔎 Similar Papers
No similar papers found.
D
Dong-Ho Lee
University of Southern California
A
A. Maharana
Databricks Mosaic Research
J
J. Pujara
University of Southern California
Xiang Ren
Xiang Ren
ExxonMobil
Computational Mechanics
Francesco Barbieri
Francesco Barbieri
Meta
GenAINLP