AccompGen: Hierarchical Autoregressive Vocal Accompaniment Generation with Dual-Rate Codec Tokenization

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the problem of automatically generating harmonically coordinated instrumental accompaniments from isolated vocal inputs. The authors propose a three-level hierarchical autoregressive architecture that sequentially models semantic representations (via HuBERT at 50 Hz) followed by coarse- and fine-grained acoustic features (via EnCodec at 75 Hz). Precise temporal alignment between vocals and accompaniment is achieved through a dual-rate codec framework. The method integrates classifier-free guidance, interleaved multi-codebook prediction, and modern Transformer components—including QK-norm, GEGLU activation, RMSNorm, and T5-style relative positional bias—to enhance training stability and generalization. Evaluated on MUSDB18, the model achieves a Fréchet Audio Distance (FAD) of 2.08, outperforming retrieval-based baselines and matching state-of-the-art performance with fewer parameters.

Technology Category

Application Category

📝 Abstract

We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization.

Problem

Research questions and friction points this paper is trying to address.

music accompaniment generation

vocal-to-instrumental synthesis

audio generation

singing voice accompaniment

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-rate tokenization

hierarchical autoregressive modeling

multi-codebook prediction

classifier-free guidance