LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving acoustic fidelity, global coherence, and dynamic arrangement in singing voice-to-accompaniment generation. To this end, we propose LaDA-Band, the first approach to introduce a discrete masked diffusion model to this task. Our method integrates discrete audio tokens, bidirectional non-autoregressive modeling, and a dual-track prefix conditioning architecture. We further design a replaced token detection objective and a two-stage curriculum training strategy to effectively balance long-range structural modeling with fine-grained detail preservation. Experimental results demonstrate that LaDA-Band significantly outperforms existing methods under zero-shot conditions—without reference accompaniment—and consistently achieves superior performance in terms of acoustic quality, global harmony, and dynamic arrangement on both academic and real-world benchmarks.

Technology Category

Application Category

📝 Abstract

Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and producing dynamic orchestration across a full song. Existing open-source approaches typically make compromises among these goals. Continuous-latent generation models can capture long musical spans but often struggle to preserve fine-grained acoustic detail. In contrast, discrete autoregressive models retain local fidelity but suffer from unidirectional generation and error accumulation in extended contexts. We present LaDA-Band, an end-to-end framework that introduces Discrete Masked Diffusion to the V2A task. Our approach formulates V2A generation as Discrete Masked Diffusion, i.e., a global, non-autoregressive denoising formulation that combines the representational advantages of discrete audio codec tokens with full-sequence bidirectional context modeling. This design improves long-range structural consistency and temporal synchronization while preserving crisp acoustic details. Built on this formulation, LaDA-Band further introduces a dual-track prefix-conditioning architecture, an auxiliary replaced-token detection objective for weakly anchored accompaniment regions, and a two-stage progressive curriculum to scale Discrete Masked Diffusion to full-song vocal-to-accompaniment generation. Extensive experiments on both academic and real-world benchmarks show that LaDA-Band consistently improves acoustic authenticity, global coherence, and dynamic orchestration over existing baselines, while maintaining strong performance even without auxiliary reference audio. Codes and audio samples are available at https://github.com/Duoluoluos/TME-LaDA-Band .

Problem

Research questions and friction points this paper is trying to address.

Vocal-to-accompaniment

acoustic authenticity

global coherence

dynamic orchestration

music generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Masked Diffusion

Vocal-to-Accompaniment Generation

Non-autoregressive Modeling