🤖 AI Summary
Existing spoken dialogue systems lack plug-and-play, full-duplex semantic endpoint detection modules, resulting in unnatural interaction and high response latency. To address this, we propose the first semantic Voice Activity Detection (VAD) model tailored for streaming speech interaction. It is the first to incorporate Large Language Models (LLMs) into endpoint detection, leveraging their deep semantic understanding to identify semantically complete utterance boundaries. We adopt a sliding-window training strategy to enable low-latency streaming inference, and fully decouple the module from the main dialogue model—facilitating independent optimization and deployment. Experiments demonstrate that our approach significantly outperforms conventional acoustic VADs in both semantically complete and incomplete speech scenarios, achieving a superior trade-off between accuracy and real-time performance, and exhibiting strong practical deployability.
📝 Abstract
Spoken dialogue models have significantly advanced intelligent human extendash computer interaction, yet they lack a plug extendash and extendash play full extendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix extendashVAD, an LLM extendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix extendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix extendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full extendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next extendash generation human extendash computer interaction.