Exploring State-Space-Model based Language Model in Music Generation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work investigates the application of State Space Models (SSMs) to text-to-music generation, introducing the first adaptation of the Mamba architecture as an efficient music decoder. To enable discrete modeling, audio is encoded via Residual Vector Quantization (RVQ), and we empirically find that a single-layer codebook suffices for capturing salient musical semantics. The SiMBA encoder is then refactored into an autoregressive sequence decoder based on the SSM. Compared to Transformer baselines, our SSM decoder achieves faster convergence under low-resource training conditions and significantly more efficient inference, while yielding generated audio with superior fidelity and musicality—closer to real-world recordings. This study demonstrates the viability of SSMs for modeling long-range musical structure under computational constraints, establishing a new paradigm for lightweight, high-fidelity music generation.

Technology Category

Application Category

📝 Abstract

The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.

Problem

Research questions and friction points this paper is trying to address.

Explore Mamba-based models for text-to-music generation

Compare SiMBA with Transformer decoders in music synthesis

Achieve efficient music generation under limited resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based architecture for music generation

Single-codebook RVQ representation modeling

SiMBA adapted as efficient sequence decoder

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization