Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work proposes BandTok, a generative two-dimensional tokenizer for Mel-spectrogram representation that addresses the limitations of existing high-fidelity music codecs relying on residual vector quantization. Such approaches suffer from strong sequential dependencies after flattening, which hinder autoregressive modeling and exacerbate error accumulation. In contrast, BandTok assigns discrete tokens to individual Mel frequency bands per time frame using a single shared codebook, yielding a physically interpretable and structurally disentangled time–frequency token grid. Coupled with a 2D Rotary Position Embedding–enhanced autoregressive language model, this method formulates music generation as image-like time–frequency modeling. Evaluated under data-constrained conditions, BandTok significantly outperforms residual codebook–based baselines in both reconstruction fidelity and controllable generation. Code and audio samples are publicly released.

📝 Abstract

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

Problem

Research questions and friction points this paper is trying to address.

autoregressive music generation

audio tokenizer

residual quantization

sequence flattening

error accumulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

BandTok

2D tokenizer

time-frequency image