๐ค AI Summary
To address the challenge of deploying computationally expensive music source separation models on resource-constrained devices, this paper proposes a lightweight band-split U-Net architecture. The method introduces band-split convolution and a dual-path feature fusion mechanism within the U-Net encoder-decoder framework, enabling frequency-band-adaptive modeling and efficient cross-path information interaction. Despite its significantly reduced parameter countโ13ร fewer than state-of-the-art large modelsโthe proposed architecture achieves competitive source separation performance, matching top-performing models in Signal-to-Distortion Ratio (SDR) on MUSDB-HQ. Moreover, it demonstrates strong generalization and scalability on the extended MoisesDB dataset. This work substantially raises the performance ceiling for lightweight source separation models and provides a viable solution for real-time, on-device music separation.
๐ Abstract
In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.