FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of model interruption capability, response latency, and dialogue quality evaluation in full-duplex voice interaction under urgent scenarios. To this end, we introduce FLEXI—the first dedicated benchmark covering six realistic human–machine interaction settings. Methodologically, we propose the first emergency interruption mechanism, integrating real-time streaming speech processing, end-to-end latency measurement, and dialogue validity scoring; we further introduce a next-word-pair prediction paradigm to model natural turn-taking. Key contributions include: (1) a systematic analysis revealing substantial performance gaps between open-source and commercial large language models in emergency awareness, turn termination, and latency control; (2) the establishment of a multi-dimensional evaluation framework for full-duplex voice interaction; and (3) empirical evidence demonstrating critical bottlenecks in existing models regarding abrupt interruption handling and low-latency response—while our word-pair prediction paradigm significantly improves interaction fluency.

Technology Category

Application Category

📝 Abstract
Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking full-duplex human-LLM spoken interaction systems
Evaluating latency, quality and effectiveness in real-time dialogue
Addressing emergency awareness and interruption capabilities in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for full-duplex LLM-human spoken interaction
Systematically evaluates latency, quality and conversational effectiveness
Proposes next token-pair prediction for seamless interaction
🔎 Similar Papers
No similar papers found.
Yuan Ge
Yuan Ge
Northeastern University, China
ReasoningMultimodality LLMs
S
Saihan Chen
School of Computer Science and Engineering, Northeastern University, Shenyang, China
J
Jingqi Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
X
Xiaoqian Liu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yan Xiang
Kunming University of Science and Technology
Zhengtao Yu
Zhengtao Yu
Kunming University of Science and Technology
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing