Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spoken dialogue systems lack plug-and-play, full-duplex semantic endpoint detection modules, resulting in unnatural interaction and high response latency. To address this, we propose the first semantic Voice Activity Detection (VAD) model tailored for streaming speech interaction. It is the first to incorporate Large Language Models (LLMs) into endpoint detection, leveraging their deep semantic understanding to identify semantically complete utterance boundaries. We adopt a sliding-window training strategy to enable low-latency streaming inference, and fully decouple the module from the main dialogue model—facilitating independent optimization and deployment. Experiments demonstrate that our approach significantly outperforms conventional acoustic VADs in both semantically complete and incomplete speech scenarios, achieving a superior trade-off between accuracy and real-time performance, and exhibiting strong practical deployability.

Technology Category

Application Category

📝 Abstract
Spoken dialogue models have significantly advanced intelligent human extendash computer interaction, yet they lack a plug extendash and extendash play full extendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix extendashVAD, an LLM extendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix extendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix extendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full extendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next extendash generation human extendash computer interaction.
Problem

Research questions and friction points this paper is trying to address.

Lack of plug-and-play full-duplex semantic endpoint detection for speech dialogue models
Need for streaming semantic endpoint detection that supports real-time inference
Requirement for independent optimization of endpoint detection from dialogue models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based model for semantic endpoint detection
Sliding window training for streaming inference
Independent full-duplex prediction module optimization
🔎 Similar Papers
No similar papers found.
Weijie Wu
Weijie Wu
Roblox
Computer Networks
Wenhao Guan
Wenhao Guan
Xiamen University
speech
K
Kaidi Wang
School of Informatics, Xiamen University, China
P
Peijie Chen
School of Informatics, Xiamen University, China
Z
Zhuanling Zha
DiDi Global Inc., Beijing, China
Junbo Li
Junbo Li
University of Texas at Austin
agentic reasoning LLMreinforcement learning
J
Jun Fang
DiDi Global Inc., Beijing, China
L
Lin Li
School of Electronic Science and Engineering, Xiamen University, China
Q
Qingyang Hong
School of Informatics, Xiamen University, China