🤖 AI Summary
This work addresses the critical limitation of large language models (LLMs) in maintaining long-horizon narrative consistency and rule fidelity within text-based role-playing games (RPGs). To this end, we introduce the first benchmark explicitly designed to evaluate LLM capabilities for text RPG engines, covering two core tasks: game creation and dynamic simulation. We propose a structured event-state representation formalism and a dual-track “LLM-as-a-judge” evaluation paradigm that jointly assesses objective rule adherence and subjective narrative quality. Experimental results reveal that state-of-the-art LLMs, while generating highly engaging narratives, consistently suffer from state drift and rule violations in complex, long-duration RPG scenarios. Our benchmark provides the first reproducible, verifiable, and controllable evaluation standard for interactive narrative generation, empirically exposing a fundamental bottleneck in LLMs’ long-range logical coherence and temporal reasoning.
📝 Abstract
We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a strong LLM grades each candidate's outputs. Empirical results demonstrate that state-of-the-art LLMs can produce engaging stories but often struggle to implement consistent, verifiable game mechanics, particularly in long or complex scenarios. By combining structured, rule-based assessments with LLM-based judgments, RPGBench provides a new standard for evaluating how well LLMs can balance creativity, coherence, and complexity in text-based RPGs, opening avenues for more immersive and controllable interactive storytelling.