Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current audio-based role-playing evaluation relies on audio large language models (ALLMs) as zero-shot evaluators, suffering from three key limitations: neglect of paralinguistic cues, coarse-grained scoring, and dependence on distorted synthetic speech references. To address these, we propose DRAME-Eval—a human-aligned evaluation framework introducing the first dual-path paradigm: “character prototype-driven” (top-down) and “real-speech-anchored” (bottom-up). We further construct the first bilingual, human-annotated speech evaluation benchmark and implement fine-tuned ALLMs for multi-dimensional, fine-grained scoring. Experiments demonstrate substantial improvements in human correlation: Pearson’s *r* = 0.629 (+0.149) for prototype fidelity and 0.625 (+0.235) for realism—significantly outperforming zero-shot and few-shot baselines.

Technology Category

Application Category

📝 Abstract

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

Problem

Research questions and friction points this paper is trying to address.

Evaluating speech role-play lacks paralinguistic cues and nuanced scoring

Current benchmarks rely on synthetic references and coarse metrics

No unified framework exists for reproducible spoken role-play assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned evaluation model outperforms zero-shot ALLMs

Benchmark with human-annotated data for speech evaluation

Dual evaluation strategies for archetype and realism assessment

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation