🤖 AI Summary
Current LLM-based evaluation of role-playing systems relies on unvalidated large language models (LLMs) as judges, assuming accurate speaker attribution—i.e., role identification—as a foundational capability; yet this assumption remains empirically unexamined. Method: We introduce PersonaEval, the first benchmark explicitly designed to assess role identification ability, constructed from authentic human dialogues (novels, screenplays, video transcripts) and rigorously evaluated via controlled experiments and human baseline studies. Contribution/Results: Our evaluation reveals that state-of-the-art LLMs achieve only 69% accuracy on role attribution—significantly below the human baseline of 90.8%—exposing a fundamental limitation in role reasoning. This work establishes role identification as a necessary precondition for reliable role-playing evaluation and provides a reproducible, human-grounded assessment framework, thereby laying the foundation for more trustworthy LLM-as-judge paradigms.
📝 Abstract
Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.