Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a progressive quality degradation in large language models (LLMs) across multi-turn, knowledge-intensive role-playing dialogues—such as professional training simulations. To address the lack of benchmarks capturing multi-turn, knowledge-dependent interactions, we introduce the first dedicated multi-turn degradation benchmark for this setting. We further propose a hybrid evaluation framework integrating human assessment (N=38) and LLM-based adjudication (Gemini 2.0 Flash), supporting zero-shot pairwise preference ranking and six-shot constructive scoring. Experimental results reveal significant declines in LLM-generated responses’ naturalness and contextual consistency over turns, whereas human responses consistently improve. Participants strongly prefer human dialogues. Automated evaluations align closely with human judgments (Spearman’s ρ > 0.9), robustly confirming the degradation trend. This work establishes a verifiable evaluation paradigm and empirical foundation for reliably integrating LLMs into high-fidelity training simulations.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation ($N=38$) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where Gemini 2.0 Flash achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance degradation in multi-turn role-play dialogues
Comparing human-authored versus LLM-generated responses in training simulations
Developing hybrid evaluation framework for knowledge-grounded dialogue assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human evaluation and automated LLM-as-a-judge assessment
Multi-turn benchmark exposing LLM degradation in dialogues
Validated hybrid evaluation framework for training simulations
🔎 Similar Papers
No similar papers found.