TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the high parameter count and computational cost of speech separation models in low-latency applications, this paper proposes TIGER—a lightweight and efficient architecture. Its key contributions are: (1) a novel time-frequency interleaved band partitioning and compression mechanism that enforces structured sparsity in the frequency domain; (2) multi-scale selective attention and full-band frame-wise attention modules, jointly capturing fine-grained local patterns and global temporal dependencies; and (3) EchoSet, a high-fidelity reverberant noise dataset designed to enhance model robustness. Experiments demonstrate that TIGER reduces model parameters by 94.3% and multiply-accumulate operations (MACs) by 95.3%, while outperforming the state-of-the-art TF-GridNet in separation quality. Moreover, TIGER exhibits superior generalization on both EchoSet and real-world scenarios, confirming its effectiveness under practical deployment constraints.

Technology Category

Application Category

📝 Abstract

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.

Problem

Research questions and friction points this paper is trying to address.

Develops efficient speech separation model with reduced parameters and computational costs.

Introduces EchoSet dataset for realistic evaluation in complex acoustic environments.

Achieves state-of-the-art performance with significant parameter and MACs reduction.

Innovation

Methods, ideas, or system contributions that make the work stand out.

TIGER reduces parameters and computational costs.

Multi-scale selective attention extracts contextual features.

EchoSet dataset improves model generalization ability.

🔎 Similar Papers

No similar papers found.