REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This paper addresses the core challenge of modeling emotional intelligence (EI) and personality consistency in long-term, open-domain conversational agents. Methodologically, it introduces the first dialogue dataset derived from 21 days of real-world messaging application logs—replacing synthetic data with authentic human conversations—and establishes two novel benchmark tasks: personality simulation and memory probing. The dataset includes fine-grained emotion annotations, systematic personality consistency analysis, and comparative LLM evaluations. Key contributions are: (1) empirical evidence that current models fail to reliably infer user personality solely from dialogue history; (2) validation that user-specific fine-tuning substantially improves personality fidelity; and (3) discovery that state-of-the-art LLMs exhibit markedly degraded performance on long-horizon memory QA in real-world settings versus synthetic benchmarks—revealing a fundamental limitation in contextual memory grounding under authentic usage conditions.

Technology Category

Application Category

📝 Abstract

Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.

Problem

Research questions and friction points this paper is trying to address.

Long-term conversation capabilities

Emotional intelligence in chatbots

Real-world dialogue patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world dialogue dataset

Persona simulation task

Long-term memory probing

🔎 Similar Papers

No similar papers found.