Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Current evaluation practices for personalized dialogue systems rely excessively on surface-level similarity metrics such as BLEU and ROUGE, which fail to capture deeper cognitive and linguistic qualities like discourse coherence, persona consistency, and shared understanding. This work proposes a novel evaluation paradigm grounded in principles from cognitive science and linguistics, using the LAPDOG framework as a case study. The approach integrates human judgments with large language model–based automated assessments to systematically examine the alignment between conventional metrics and cognitively informed dimensions of dialogue quality. Results reveal high agreement between human evaluators and large language models on these deeper aspects of dialogue quality, yet both significantly diverge from traditional automatic metrics, thereby demonstrating the necessity and feasibility of cognition-driven evaluation frameworks for personalized dialogue systems.

Technology Category

Application Category

📝 Abstract

In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented dialogue

evaluation metrics

conversational coherence

personalized dialogue

cognitive grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented dialogue

cognitive evaluation

personalized dialogue