FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing role-playing (RP) benchmarks suffer from narrow scenario coverage, outdated interaction paradigms, and poor generalization—rendering them ill-suited to the rapid evolution of large language models (LLMs). To address this, we propose FURINA, the first customizable RP benchmark construction framework, leveraging a multi-agent collaborative pipeline for fully automated, fine-grained benchmark generation and evaluation across arbitrary roles, scenarios, and prompt formats. Our contributions include: (1) an LLM-based adjudication mechanism for objective role fidelity assessment; (2) a modular role-scenario pool enabling systematic benchmark composition; and (3) a dimension-adaptive scoring method that uncovers a novel phenomenon—enhanced reasoning capability exacerbates role hallucination—while revealing a non-monotonic relationship between model scale and hallucination. We further establish a performance-reliability Pareto frontier. Validated on FURINA-Bench—built from both real and synthetic characters—o3 (English) and DeepSeek-R1 (Chinese) achieve state-of-the-art results.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
Problem

Research questions and friction points this paper is trying to address.

Addresses outdated role-playing benchmarks with narrow scope
Introduces scalable pipeline for customizable multi-agent evaluation
Investigates trade-off between role-playing performance and hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent pipeline for customizable role-playing benchmarks
Automated character-scene pool generation for diverse scenarios
LLM judge adjusts responses into final test utterances
🔎 Similar Papers
No similar papers found.
H
Haotian Wu
The Hong Kong University of Science and Technology (Guangzhou)
Shufan Jiang
Shufan Jiang
East China University of Science and Technology
Large Language ModelsMulti-Agent SystemsScaling Environment for AgentsWorld Models
C
Chios Chen
National University of Singapore
Y
Yiyang Feng
Stony Brook University
Hehai Lin
Hehai Lin
The Hong Kong University of Science and Technology (GuangZhou)
NLP/LLM/LVLM ReasoningMulti-agent system (MAS)
Heqing Zou
Heqing Zou
NTU
deep learning
Y
Yao Shu
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yanran Li
Independent Researcher
Chengwei Qin
Chengwei Qin
HKUST(GZ), NTU
LLMNLP