PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-based evaluation of role-playing systems relies on unvalidated large language models (LLMs) as judges, assuming accurate speaker attribution—i.e., role identification—as a foundational capability; yet this assumption remains empirically unexamined. Method: We introduce PersonaEval, the first benchmark explicitly designed to assess role identification ability, constructed from authentic human dialogues (novels, screenplays, video transcripts) and rigorously evaluated via controlled experiments and human baseline studies. Contribution/Results: Our evaluation reveals that state-of-the-art LLMs achieve only 69% accuracy on role attribution—significantly below the human baseline of 90.8%—exposing a fundamental limitation in role reasoning. This work establishes role identification as a necessary precondition for reliable role-playing evaluation and provides a reproducible, human-grounded assessment framework, thereby laying the foundation for more trustworthy LLM-as-judge paradigms.

Technology Category

Application Category

📝 Abstract
Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.
Problem

Research questions and friction points this paper is trying to address.

Assessing if LLM evaluators can reliably identify human roles in role-play
Evaluating role-playing quality by correctly attributing dialogue to personas
Comparing LLM and human accuracy in judging role-play scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

PersonaEval benchmark tests LLM role identification
Uses human-authored dialogues for evaluation
Highlights LLM-human gap in reasoning abilities
🔎 Similar Papers
No similar papers found.