PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

📅 2024-09-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing evaluation of role-playing language models suffers from heavy reliance on manual annotation, static single-dimensional metrics, and poor scalability. To address these limitations, this paper proposes the first large-model-driven, dynamic closed-loop evaluation framework. Methodologically, it introduces a tripartite role-based architecture—“Player,” “Inquirer,” and “Judge”—where the model autonomously simulates diverse user dialogue behaviors and enables multi-judge collaborative assessment to automate scoring and alignment of multi-turn interaction quality. Crucially, it establishes an end-to-end evaluation loop without human annotation. Experimental results demonstrate high correlation (Spearman ρ > 0.85) between automated scores and human judgments across key dimensions—including coherence, role consistency, and response appropriateness—while significantly improving scalability, cross-model consistency, and ecosystem compatibility.

Technology Category

Application Category

📝 Abstract

We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model that assumes a specific character role, an interrogator model that simulates user behavior, and several judge models that evaluate conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of the model capabilities in interactive scenarios.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Role-playing

User Dialogue Simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

novel testing method

language model evaluation

dialogue simulation

🔎 Similar Papers

BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model