🤖 AI Summary
Existing music AI systems for vocal-conditioned accompaniment generation suffer from excessive parameter counts, slow inference, and poor suitability for real-time deployment. To address these challenges, this paper proposes a lightweight latent diffusion model. First, a pretrained VQ-VAE is employed to construct a compact latent space. Second, a time-aware soft-alignment attention mechanism is introduced to adaptively integrate local and global temporal dependencies while dynamically adjusting to diffusion step progression. Third, an ultra-lightweight architecture enables end-to-end efficient modeling. The resulting model contains only 15 million parameters—220× fewer than OpenAI’s Jukebox—and achieves 52× faster inference. It supports real-time accompaniment generation on consumer-grade GPUs without compromising audio fidelity; objective and subjective evaluations confirm superior sound quality over prior approaches.
📝 Abstract
We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi- scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parame- ters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation ac- cessible for interactive applications and resource-constrained environments.