DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Native multimodal large language models (MLLMs) suffer from catastrophic forgetting and performance degradation due to scarcity of speech-text paired data. Method: We propose a modality-aware Mixture-of-Experts (MoE) architecture with an adaptive expert separation mechanism—employing unimodal specialization pretraining and multimodal collaborative fine-tuning to decouple modality-specific learning, eliminating the need for external speech decoders and enabling end-to-end speech–text joint generation. Contribution/Results: Our approach preserves paralinguistic features (e.g., emotion, prosody) while maintaining end-to-end dialogue latency under 0.5 seconds. Relative to the base LLM, task performance degrades by only 5.5%, substantially outperforming comparable models (average degradation >20%). To our knowledge, this is the first native MLLM achieving simultaneous high-fidelity speech generation, ultra-low-latency interaction, and strong linguistic capability.

Technology Category

Application Category

📝 Abstract
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in native MLLMs due to insufficient speech-text data.
Proposes adaptive modality expert learning to preserve LLM performance.
Reduces response latency for seamless smart speech interaction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive MoE for modality expert learning
Specialized single-modality training first
Joint multimodal collaborative training after
🔎 Similar Papers
No similar papers found.