🤖 AI Summary
This work investigates the application of State Space Models (SSMs) to text-to-music generation, introducing the first adaptation of the Mamba architecture as an efficient music decoder. To enable discrete modeling, audio is encoded via Residual Vector Quantization (RVQ), and we empirically find that a single-layer codebook suffices for capturing salient musical semantics. The SiMBA encoder is then refactored into an autoregressive sequence decoder based on the SSM. Compared to Transformer baselines, our SSM decoder achieves faster convergence under low-resource training conditions and significantly more efficient inference, while yielding generated audio with superior fidelity and musicality—closer to real-world recordings. This study demonstrates the viability of SSMs for modeling long-range musical structure under computational constraints, establishing a new paradigm for lightweight, high-fidelity music generation.
📝 Abstract
The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.