MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing FD-SLM evaluation benchmarks are limited to single-turn dialogues, overlooking the complexity of multi-turn interaction and lacking systematic assessment of instruction following and safety. To address challenges in full-duplex multi-turn dialogue—such as ambiguous turn boundaries and contextual inconsistency—this paper introduces MTR-DuplexBench, the first benchmark enabling fine-grained multi-turn evaluation. Our method discretizes continuous speech interaction via dialogue segmentation and establishes a four-dimensional, human-in-the-loop scoring framework covering dialogue quality, interaction dynamics, instruction adherence, and safety. Experimental results reveal significant deficiencies in current state-of-the-art FD-SLMs regarding multi-turn consistency and cross-dimensional coordination, thereby validating both the necessity and effectiveness of MTR-DuplexBench. The benchmark provides quantifiable metrics to guide iterative model improvement, advancing rigorous, holistic evaluation of full-duplex conversational AI systems.

Technology Category

Application Category

📝 Abstract

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-round conversational abilities of full-duplex speech models

Addressing blurred turn boundaries and context inconsistency in dialogues

Assessing instruction following and safety in overlapping speech interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segments continuous dialogues into discrete turns

Enables turn-by-turn evaluation of full-duplex models

Assesses dialogue quality, dynamics, instruction following, and safety

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

2024-09-10arXiv.orgCitations: 0

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

2024-09-19arXiv.orgCitations: 0

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

2024-02-28arXiv.orgCitations: 93