MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational complexity, substantial memory footprint, and deployment challenges in diffusion-based text-to-speech synthesis caused by reliance on attention mechanisms and recurrent structures. The authors propose the first fully state space model (SSM)-based conditional pathway, entirely eliminating attention and RNN modules. Their approach employs a gated bidirectional Mamba text encoder, a temporally bidirectional Mamba alignment module, and an expressive Mamba modulated via AdaLN, trained under the supervision of a lightweight alignment teacher. This design achieves high-quality speech generation while significantly improving memory efficiency, inference stability, and streaming capability. Experiments demonstrate consistent performance gains over strong baselines such as StyleTTS2 and VITS across multiple datasets, with improvements in MOS/CMOS scores, F0 RMSE, MCD, and WER. The encoder size is reduced to 21M parameters, and throughput increases by 1.6×.
📝 Abstract
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
state-space modeling
diffusion model
attention-free
voice cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

State-Space Model
Diffusion-based TTS
Attention-free Architecture
Linear-time Inference
Expressive Voice Cloning
🔎 Similar Papers
No similar papers found.