DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Native multimodal large language models (MLLMs) suffer from catastrophic forgetting and performance degradation due to scarcity of speech-text paired data. Method: We propose a modality-aware Mixture-of-Experts (MoE) architecture with an adaptive expert separation mechanism—employing unimodal specialization pretraining and multimodal collaborative fine-tuning to decouple modality-specific learning, eliminating the need for external speech decoders and enabling end-to-end speech–text joint generation. Contribution/Results: Our approach preserves paralinguistic features (e.g., emotion, prosody) while maintaining end-to-end dialogue latency under 0.5 seconds. Relative to the base LLM, task performance degrades by only 5.5%, substantially outperforming comparable models (average degradation >20%). To our knowledge, this is the first native MLLM achieving simultaneous high-fidelity speech generation, ultra-low-latency interaction, and strong linguistic capability.

Technology Category

Application Category

📝 Abstract

Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in native MLLMs due to insufficient speech-text data.

Proposes adaptive modality expert learning to preserve LLM performance.

Reduces response latency for seamless smart speech interaction.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive MoE for modality expert learning

Specialized single-modality training first

Joint multimodal collaborative training after

🔎 Similar Papers

No similar papers found.

Nvidia

base salary range is 192,000 USD - 304,750 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

US, CA, Santa Clara

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs