🤖 AI Summary
This work addresses the challenge in full-duplex dialogue systems of simultaneously achieving low latency and robust semantic understanding in turn-taking detection. Conventional voice activity detection lacks semantic awareness, while ASR-dependent approaches suffer from high latency and poor robustness under overlapping speech and noisy conditions. To overcome these limitations, we propose FastTurn, a novel framework that unifies streaming CTC decoding with acoustic features to enable early turn-taking decisions from partial speech observations while preserving critical semantic cues. We introduce a new test set incorporating realistic conversational dynamics—including speech overlap, backchannel utterances, and environmental noise—and demonstrate that FastTurn significantly reduces interruption latency without compromising accuracy, outperforming existing baselines under challenging conditions.
📝 Abstract
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.