Exploring State-Space-Model based Language Model in Music Generation

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the application of State Space Models (SSMs) to text-to-music generation, introducing the first adaptation of the Mamba architecture as an efficient music decoder. To enable discrete modeling, audio is encoded via Residual Vector Quantization (RVQ), and we empirically find that a single-layer codebook suffices for capturing salient musical semantics. The SiMBA encoder is then refactored into an autoregressive sequence decoder based on the SSM. Compared to Transformer baselines, our SSM decoder achieves faster convergence under low-resource training conditions and significantly more efficient inference, while yielding generated audio with superior fidelity and musicality—closer to real-world recordings. This study demonstrates the viability of SSMs for modeling long-range musical structure under computational constraints, establishing a new paradigm for lightweight, high-fidelity music generation.

Technology Category

Application Category

📝 Abstract
The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.
Problem

Research questions and friction points this paper is trying to address.

Explore Mamba-based models for text-to-music generation
Compare SiMBA with Transformer decoders in music synthesis
Achieve efficient music generation under limited resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based architecture for music generation
Single-codebook RVQ representation modeling
SiMBA adapted as efficient sequence decoder
W
Wei-Jaw Lee
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
F
Fang-Chih Hsieh
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
Xuanjun Chen
Xuanjun Chen
National Taiwan University
Speech ProcessingMachine LearningGenerative AIDeepfakes
Fang-Duo Tsai
Fang-Duo Tsai
National Taiwan University
Music AI
Yi-Hsuan Yang
Yi-Hsuan Yang
National Taiwan University
Music information retrievalMusic GenerationMusic ProcessingMusic AIAffective computing