Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing spoken dialogue systems lack plug-and-play, full-duplex semantic endpoint detection modules, resulting in unnatural interaction and high response latency. To address this, we propose the first semantic Voice Activity Detection (VAD) model tailored for streaming speech interaction. It is the first to incorporate Large Language Models (LLMs) into endpoint detection, leveraging their deep semantic understanding to identify semantically complete utterance boundaries. We adopt a sliding-window training strategy to enable low-latency streaming inference, and fully decouple the module from the main dialogue model—facilitating independent optimization and deployment. Experiments demonstrate that our approach significantly outperforms conventional acoustic VADs in both semantically complete and incomplete speech scenarios, achieving a superior trade-off between accuracy and real-time performance, and exhibiting strong practical deployability.

Technology Category

Application Category

📝 Abstract

Spoken dialogue models have significantly advanced intelligent human extendash computer interaction, yet they lack a plug extendash and extendash play full extendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix extendashVAD, an LLM extendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix extendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix extendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full extendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next extendash generation human extendash computer interaction.

Problem

Research questions and friction points this paper is trying to address.

Lack of plug-and-play full-duplex semantic endpoint detection for speech dialogue models

Need for streaming semantic endpoint detection that supports real-time inference

Requirement for independent optimization of endpoint detection from dialogue models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based model for semantic endpoint detection

Sliding window training for streaming inference

Independent full-duplex prediction module optimization

🔎 Similar Papers

No similar papers found.