FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in full-duplex dialogue systems of simultaneously achieving low latency and robust semantic understanding in turn-taking detection. Conventional voice activity detection lacks semantic awareness, while ASR-dependent approaches suffer from high latency and poor robustness under overlapping speech and noisy conditions. To overcome these limitations, we propose FastTurn, a novel framework that unifies streaming CTC decoding with acoustic features to enable early turn-taking decisions from partial speech observations while preserving critical semantic cues. We introduce a new test set incorporating realistic conversational dynamics—including speech overlap, backchannel utterances, and environmental noise—and demonstrate that FastTurn significantly reduces interruption latency without compromising accuracy, outperforming existing baselines under challenging conditions.
📝 Abstract
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

turn detection
full-duplex dialogue
low-latency
robustness
semantic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

FastTurn
streaming CTC decoding
acoustic-semantic fusion
low-latency turn detection
full-duplex dialogue
🔎 Similar Papers
No similar papers found.
C
Chengyou Wang
Audio, Speech and Language Processing Group (ASLP@NPU)
Hongfei Xue
Hongfei Xue
Northwestern Polytechnical University
Speech recognitionself-supervised learning
C
Chunjiang He
Audio, Speech and Language Processing Group (ASLP@NPU)
J
Jingbin Hu
Audio, Speech and Language Processing Group (ASLP@NPU)
S
Shuiyuan Wang
Audio, Speech and Language Processing Group (ASLP@NPU)
Bo Wu
Bo Wu
Department of Land Surveying & Geo-Informatics, The Hong Kong Polytechnic University
Photogrammetry and Robotic VisionPlanetary Remote Sensing and Mapping3D GIS
Y
Yuyu Ji
Shengwang
J
Jimeng Zheng
Shengwang
R
Ruofei Chen
Shengwang
Z
Zhou Zhu
QualiaLabs
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence