RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Existing NLP evaluation metrics struggle to effectively assess role-playing large language models (LLMs) in terms of character consistency, logical coherence, and long-term narrative stability. To address this gap, this work proposes RPA-Check, a four-stage automated evaluation framework that decomposes evaluation dimensions, generates Boolean checklists, performs semantic deduplication and isolation, and integrates a chain-of-thought-enhanced LLM-as-a-Judge mechanism. This framework establishes the first structured, reproducible benchmark specifically designed for role-playing agents. Experimental results from the LLM Court forensic training game reveal that instruction-finetuned small models (8–9B parameters) outperform larger counterparts in procedural consistency, suggesting an inverse relationship between model scale and role consistency—thereby challenging the prevailing “bigger is better” paradigm in LLM development.

Technology Category

Application Category

📝 Abstract
The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.
Problem

Research questions and friction points this paper is trying to address.

Role-Playing Agents
Large Language Models
Automated Evaluation
Narrative Stability
Agent Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Role-Playing Agents
Automated Evaluation
LLM-as-a-Judge
Chain-of-Thought Verification
Behavioral Consistency
R
Riccardo Rosati
Department of Political Sciences, Communication and International Relations, University of Macerata, Via Don Minzoni 22/A, Macerata, 62100, Italy
E
Edoardo Colucci
Department of Information Engineering, Università Politecnica delle Marche, Via Brecce Bianche 12, Ancona, 60131, Italy
M
Massimiliano Bolognini
Department of Information Engineering, Università Politecnica delle Marche, Via Brecce Bianche 12, Ancona, 60131, Italy
Adriano Mancini
Adriano Mancini
Università Politecnica delle Marche, Dipartimento di Ingegneria dell'Informazione
Remote SensingGISUnmanned Systems
P
Paolo Sernani
Department of Law, University of Macerata, Piaggia dell’Università 2, Macerata, 62100, Italy