🤖 AI Summary
Existing music source restoration (MSR) benchmarks suffer from two fundamental limitations: synthetic mixtures exhibit unrealistic distortions, and real-world recordings lack clean reference signals, impeding faithful fidelity evaluation. To address this, we introduce MSRBench—the first authoritative, production-grade MSR benchmark. It comprises high-fidelity mixtures generated by professionally mixing original dry recordings of eight instrument classes, augmented with twelve realistic degradations—including analog hardware distortion, acoustic reverberation, and lossy compression. Crucially, MSRBench is the first to provide authentic dry–mix pairs with ground-truth clean references, bridging the gap between synthetic benchmarks and real-world MSR applications. Baseline evaluations using U-Net and BSRNN reveal severe performance bottlenecks: SI-SNR scores reach only −37.8 dB and −23.4 dB, respectively, while FAD and CLAP scores cluster narrowly within 0.7–0.8. These results underscore the benchmark’s critical role in advancing MSR research and evaluation.
📝 Abstract
Music Source Restoration (MSR) extends source separation to realistic settings where signals undergo production effects (equalization, compression, reverb) and real-world degradations, with the goal of recovering the original unprocessed sources. Existing benchmarks cannot measure restoration fidelity: synthetic datasets use unprocessed stems but unrealistic mixtures, while real production datasets provide only already-processed stems without clean references. We present MSRBench, the first benchmark explicitly designed for MSR evaluation. MSRBench contains raw stem-mixture pairs across eight instrument classes, where mixtures are produced by professional mixing engineers. These raw-processed pairs enable direct evaluation of both separation accuracy and restoration fidelity. Beyond controlled studio conditions, the mixtures are augmented with twelve real-world degradations spanning analog artifacts, acoustic environments, and lossy codecs. Baseline experiments with U-Net and BSRNN achieve SI-SNR of -37.8 dB and -23.4 dB respectively, with perceptual quality (FAD CLAP) around 0.7-0.8, demonstrating substantial room for improvement and the need for restoration-specific architectures.