π€ AI Summary
To address the high parameter count and computational cost of speech separation models in low-latency applications, this paper proposes TIGERβa lightweight and efficient architecture. Its key contributions are: (1) a novel time-frequency interleaved band partitioning and compression mechanism that enforces structured sparsity in the frequency domain; (2) multi-scale selective attention and full-band frame-wise attention modules, jointly capturing fine-grained local patterns and global temporal dependencies; and (3) EchoSet, a high-fidelity reverberant noise dataset designed to enhance model robustness. Experiments demonstrate that TIGER reduces model parameters by 94.3% and multiply-accumulate operations (MACs) by 95.3%, while outperforming the state-of-the-art TF-GridNet in separation quality. Moreover, TIGER exhibits superior generalization on both EchoSet and real-world scenarios, confirming its effectiveness under practical deployment constraints.
π Abstract
In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.