🤖 AI Summary
Existing NLP evaluation metrics struggle to effectively assess role-playing large language models (LLMs) in terms of character consistency, logical coherence, and long-term narrative stability. To address this gap, this work proposes RPA-Check, a four-stage automated evaluation framework that decomposes evaluation dimensions, generates Boolean checklists, performs semantic deduplication and isolation, and integrates a chain-of-thought-enhanced LLM-as-a-Judge mechanism. This framework establishes the first structured, reproducible benchmark specifically designed for role-playing agents. Experimental results from the LLM Court forensic training game reveal that instruction-finetuned small models (8–9B parameters) outperform larger counterparts in procedural consistency, suggesting an inverse relationship between model scale and role consistency—thereby challenging the prevailing “bigger is better” paradigm in LLM development.
📝 Abstract
The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.