🤖 AI Summary
This work addresses the challenge that current large language models struggle to simulate the internal cognitive processes of characters in role-playing scenarios, primarily due to the absence of high-quality reasoning trajectories and reward signals aligned with human preferences. To overcome this limitation, the authors propose the HER framework, which introduces a novel two-layer reasoning mechanism that explicitly distinguishes between first-person character thoughts and third-person model-based reasoning. By integrating reasoning-augmented data constructed via reverse engineering and a reward model trained to align with human preferences, the framework jointly optimizes Qwen3-32B through supervised and reinforcement learning. Evaluated on CoSER and Minimax Role-Play Bench, the approach achieves performance gains of 30.26 and 14.97 points, respectively, substantially outperforming existing baselines and advancing role-playing from superficial imitation toward deep cognitive simulation.
📝 Abstract
LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters'first-person thinking from LLMs'third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train HER models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.