RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing role-playing datasets overlook identity consistency under instruction-following scenarios. This work introduces RoleMRC, the first fine-grained benchmark jointly evaluating role-playing and instruction following, comprising three task types: multi-turn role-based dialogue, role-conditioned machine reading comprehension, and nested-priority instruction execution. It contains 10.2K role profiles, 37.9K synthetic instructions, and 1.4K test instances. Methodologically, we model the coupling between role-identity constraints and instruction executability, proposing a novel ternary decision mechanism—respond, refuse, or attempt—and enabling capability attribution at the neural activation level. Leveraging synthetic data generation, multidimensional capability assessment, cross-dataset transfer validation, and internal LLM activation probing, our approach improves instruction-following accuracy of mainstream LLMs by +18.7%, while preserving role consistency and enhancing reasoning robustness.

Technology Category

Application Category

📝 Abstract

Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role's pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: https://github.com/LuJunru/RoleMRC.

Problem

Research questions and friction points this paper is trying to address.

Enhancing role-playing in instruction-following scenarios.

Developing a fine-grained benchmark for role-playing capabilities.

Quantitatively evaluating LLMs on complex role-playing tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained role-playing benchmark

Multi-turn instruction-following evaluation

Neural-level activation analysis

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation