🤖 AI Summary
This paper addresses the challenges of model interruption capability, response latency, and dialogue quality evaluation in full-duplex voice interaction under urgent scenarios. To this end, we introduce FLEXI—the first dedicated benchmark covering six realistic human–machine interaction settings. Methodologically, we propose the first emergency interruption mechanism, integrating real-time streaming speech processing, end-to-end latency measurement, and dialogue validity scoring; we further introduce a next-word-pair prediction paradigm to model natural turn-taking. Key contributions include: (1) a systematic analysis revealing substantial performance gaps between open-source and commercial large language models in emergency awareness, turn termination, and latency control; (2) the establishment of a multi-dimensional evaluation framework for full-duplex voice interaction; and (3) empirical evidence demonstrating critical bottlenecks in existing models regarding abrupt interruption handling and low-latency response—while our word-pair prediction paradigm significantly improves interaction fluency.
📝 Abstract
Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.