TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based personality simulation evaluation suffers from three key limitations: overreliance on synthetic data, absence of a systematic evaluation framework, and failure to decouple underlying capabilities. To address these, we introduce TwinVoice—the first real-world-oriented, multidimensional benchmark for personality simulation—modeling personality along social, interpersonal, and narrative dimensions. TwinVoice systematically evaluates six decoupled core capabilities: opinion consistency, memory retrieval, logical reasoning, lexical fidelity, tone/style control, and syntactic expression. Leveraging large-scale realistic dialogue simulations and human-reference experiments, we identify critical weaknesses in state-of-the-art models—particularly in syntactic style control and long-term memory maintenance—while demonstrating substantial performance gaps relative to human baselines. This work establishes a reproducible, capability-decomposable, and human-aligned evaluation paradigm for LLM personality modeling.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM persona simulation lacks systematic frameworks
Assessing multi-dimensional digital twin capabilities in real contexts
Identifying performance gaps in syntactic style and memory recall
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional benchmark for persona simulation
Decomposes evaluation into six fundamental capabilities
Assesses models across social and narrative contexts
🔎 Similar Papers
No similar papers found.
B
Bangde Du
Tsinghua University
M
Minghao Guo
Rutgers University
S
Songming He
Fudan University
Z
Ziyi Ye
Fudan University
X
Xi Zhu
Rutgers University
Weihang Su
Weihang Su
Tsinghua University
Information RetrievalNatural Language ProcessingAI for Legal
S
Shuqi Zhu
Tsinghua University
Y
Yujia Zhou
Tsinghua University
Y
Yongfeng Zhang
Rutgers University
Qingyao Ai
Qingyao Ai
Associate Professor, Dept. of CS&T, Tsinghua University
Information RetrievalMachine Learning
Y
Yiqun Liu
Tsinghua University