🤖 AI Summary
Audio source separation—encompassing speech, music, and general audio—faces a fundamental trade-off between performance and computational cost. This paper introduces the first discriminative separation framework that is scalable *both* during training and inference, enabling flexible speed–accuracy trade-offs via dynamic adjustment of inference depth—*without* retraining. Our key contributions are: (1) an early-fork multi-loss supervision architecture that provides fine-grained gradient guidance; and (2) a parameter-sharing backbone with a dynamic inference repetition mechanism, ensuring parameter efficiency and performance continuity under depth scaling. Evaluated on standard speech separation benchmarks, our method achieves state-of-the-art (SOTA) performance with fewer parameters, reduces training cost by 32%, and cuts real-time inference latency by 47%. These gains significantly enhance deployment adaptability and energy efficiency.
📝 Abstract
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation.