🤖 AI Summary
This work addresses the challenge that large language models struggle to consistently access character-specific knowledge during extended open-ended dialogues, lacking autonomous memory-driven role-playing capabilities. Inspired by Stanislavski’s theory of “emotional memory,” the authors propose a memory-driven role-playing paradigm that models character knowledge as internal memory and relies solely on dialogue context for retrieval and generation. They introduce MREval, a comprehensive evaluation framework; MRPrompt, a novel prompting architecture; and MRBench, a bilingual benchmark, along with a four-stage capability assessment protocol. Experimental results demonstrate that this approach significantly enhances character consistency, enabling smaller models such as Qwen3-8B equipped with MRPrompt to match the performance of proprietary large models like Qwen3-Max and GLM-4.7, thereby validating the critical role of memory mechanisms in improving response quality.
📝 Abstract
A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski's "emotional memory" acting theory, this paradigm frames persona knowledge as the LLM's internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.