🤖 AI Summary
This work addresses the challenge of maintaining semantic coherence in full-duplex spoken dialogue, where large language models struggle to generate consistent responses while simultaneously processing streaming user speech, often suffering from contextual interference. The study introduces user-stream routing as a foundational modeling dimension and presents a unified framework for full-duplex spoken dialogue, comparing two strategies: channel fusion—directly injecting the user stream into the generation process—and cross-attention routing—accessing external memory via adapter modules. Experimental results demonstrate that channel fusion achieves superior semantic understanding in spoken question answering but is highly sensitive to interruptions, whereas cross-attention routing, despite slightly lower task performance, substantially enhances response coherence and contextual robustness, revealing a critical trade-off between semantic integration capability and robustness in real-time conversational systems.
📝 Abstract
Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.