LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the turn-taking control challenge in full-duplex spoken dialogue systems. We propose a lightweight Semantic Voice Activity Detection (Semantic VAD) method to distinguish intentional vs. unintentional interruptions in real time, detect user query completion, and robustly handle pauses and disfluencies. Our approach introduces three key innovations: (1) a novel four-class control token generation paradigm leveraging a fine-tuned 0.5B-parameter LLM for semantic-level VAD; (2) a streaming short-time-window speech understanding framework coupled with tokenized control decision-making; and (3) explicit decoupling of the dialogue manager from the generation engine, enabling zero-shot, independent optimization without retraining. Experimental results demonstrate significant reductions in turn-taking latency and computational overhead, while maintaining high detection accuracy, enhancing interaction naturalness, and improving system scalability. The proposed modular architecture provides an efficient foundation for next-generation full-duplex spoken dialogue systems.

Technology Category

Application Category

📝 Abstract
Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
Problem

Research questions and friction points this paper is trying to address.

Enhance full-duplex communication in dialogue systems
Manage real-time turn-taking with semantic VAD
Optimize dialogue management for computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based semantic VAD
Real-time turn management
Lightweight dialogue optimization
🔎 Similar Papers
No similar papers found.
H
Hao Zhang
Tencent AI Lab, Bellevue, USA
Weiwei Li
Weiwei Li
Beijing University of Chemical Technology
Organic PhotovoltiacsOrganic Solar CellsConjugated Polymers
R
Rilin Chen
Tencent AI Lab, Beijing, China
Vinay Kothapally
Vinay Kothapally
Tencent AI Labs
Microphone Array ProcessingMachine LearningSpeech EnhancementDistant Speech Recognition
M
Meng Yu
Tencent AI Lab, Bellevue, USA
D
Dong Yu
Tencent AI Lab, Bellevue, USA