Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the complexity and optimization challenges of traditional high-fidelity music generation, which relies on heterogeneous representations separating structure from detail. The authors propose a unified two-stage generation framework within a single deep acoustic token space: a backbone model first generates coarse-grained tokens for the full composition, followed by a super-resolution model that progressively refines these tokens in parallel layers within the same space. They demonstrate, for the first time, that a purely acoustic token-based language model can spontaneously align lyrics with vocals without requiring a separate semantic stage. Initializing the super-resolution model with the backbone significantly accelerates convergence and enhances audio quality. Leveraging a 64-layer RVQ representation, hybrid attention (causal for alignment, full for refinement), and fixed 62-step efficient inference, the method achieves high-fidelity audio reconstruction while preserving precise lyric-vocal alignment, validating the efficacy and superiority of the unified acoustic token paradigm.

📝 Abstract

A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly improve lyric alignment and fine-detail reconstruction, we further introduce hybrid-attention training: the alignment objective uses causal attention, while layer-wise refinement uses full attention. A key finding is that text--vocal alignment can emerge within pure acoustic-token language modeling, without requiring a separate semantic token stage. Moreover, initializing the super-resolution model from the trained backbone significantly improves convergence and final quality. Taken together, our results suggest that high-quality music generation can be effectively pursued without separating structure and fidelity into heterogeneous representation spaces. Instead, both can be progressively modeled within a unified acoustic-token hierarchy, pointing toward a simpler and more unified path to high-quality music generation.

Problem

Research questions and friction points this paper is trying to address.

music generation

acoustic token

structure and fidelity

unified representation

high-fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

acoustic token hierarchy

residual vector quantization

coarse-to-fine generation