FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the challenges of model interruption capability, response latency, and dialogue quality evaluation in full-duplex voice interaction under urgent scenarios. To this end, we introduce FLEXI—the first dedicated benchmark covering six realistic human–machine interaction settings. Methodologically, we propose the first emergency interruption mechanism, integrating real-time streaming speech processing, end-to-end latency measurement, and dialogue validity scoring; we further introduce a next-word-pair prediction paradigm to model natural turn-taking. Key contributions include: (1) a systematic analysis revealing substantial performance gaps between open-source and commercial large language models in emergency awareness, turn termination, and latency control; (2) the establishment of a multi-dimensional evaluation framework for full-duplex voice interaction; and (3) empirical evidence demonstrating critical bottlenecks in existing models regarding abrupt interruption handling and low-latency response—while our word-pair prediction paradigm significantly improves interaction fluency.

Technology Category

Application Category

📝 Abstract

Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking full-duplex human-LLM spoken interaction systems

Evaluating latency, quality and effectiveness in real-time dialogue

Addressing emergency awareness and interruption capabilities in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for full-duplex LLM-human spoken interaction

Systematically evaluates latency, quality and conversational effectiveness

Proposes next token-pair prediction for seamless interaction

🔎 Similar Papers

No similar papers found.