TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio source separation—encompassing speech, music, and general audio—faces a fundamental trade-off between performance and computational cost. This paper introduces the first discriminative separation framework that is scalable *both* during training and inference, enabling flexible speed–accuracy trade-offs via dynamic adjustment of inference depth—*without* retraining. Our key contributions are: (1) an early-fork multi-loss supervision architecture that provides fine-grained gradient guidance; and (2) a parameter-sharing backbone with a dynamic inference repetition mechanism, ensuring parameter efficiency and performance continuity under depth scaling. Evaluated on standard speech separation benchmarks, our method achieves state-of-the-art (SOTA) performance with fewer parameters, reduces training cost by 32%, and cuts real-time inference latency by 47%. These gains significantly enhance deployment adaptability and energy efficiency.

Technology Category

Application Category

📝 Abstract
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation.
Problem

Research questions and friction points this paper is trying to address.

Scalable framework for discriminative source separation
Reducing training and deployment costs
Flexible speed-performance trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-split multi-loss supervision
Shared-parameter design architecture
Dynamic inference repetition scaling
🔎 Similar Papers
No similar papers found.
Y
Yongsheng Feng
Department of Music AI and Music IT, Central Conservatory of Music, Beijing, China
Y
Yuetonghui Xu
Department of Music AI and Music IT, Central Conservatory of Music, Beijing, China
Jiehui Luo
Jiehui Luo
University of Notre Dame
Hongjia Liu
Hongjia Liu
Aalto University
Robot LearningComputer Vision
Xiaobing Li
Xiaobing Li
University of Wisconsin-Madison SUNY College of Optometry
saccade attention decision making
Feng Yu
Feng Yu
University of Exeter
Efficient AIContinual LearningFederated LearningFoundation Model
W
Wei Li
School of Computer Science and Technology, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China