Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

πŸ“… 2026-03-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current evaluation practices for personalized dialogue systems rely excessively on surface-level similarity metrics such as BLEU and ROUGE, which fail to capture deeper cognitive and linguistic qualities like discourse coherence, persona consistency, and shared understanding. This work proposes a novel evaluation paradigm grounded in principles from cognitive science and linguistics, using the LAPDOG framework as a case study. The approach integrates human judgments with large language model–based automated assessments to systematically examine the alignment between conventional metrics and cognitively informed dimensions of dialogue quality. Results reveal high agreement between human evaluators and large language models on these deeper aspects of dialogue quality, yet both significantly diverge from traditional automatic metrics, thereby demonstrating the necessity and feasibility of cognition-driven evaluation frameworks for personalized dialogue systems.

Technology Category

Application Category

πŸ“ Abstract
In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented dialogue
evaluation metrics
conversational coherence
personalized dialogue
cognitive grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented dialogue
cognitive evaluation
personalized dialogue
coherence assessment
LLM-based judging
πŸ”Ž Similar Papers
No similar papers found.
T
Tianyi Zhang
Institute for Creative Technologies, University of Southern California
David Traum
David Traum
USC ICT
DialogueNatural language processingSocially Interactive Agentsvirtual humansintelligent user interfaces