TF-MLPNet: Tiny Real-Time Neural Speech Separation

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the challenge of deploying real-time speech separation models on ultra-low-power neural accelerators (e.g., GAP9) in hearable devices, this work proposes the first lightweight, real-time speech separation network tailored for micro-edge platforms. Methodologically, it adopts time-frequency decoupled modeling: frequency-domain interactions across channels and frequency bins are captured via fully connected layers, while temporal dynamics per frequency bin are modeled independently using convolutional layers; mixed-precision quantization-aware training (QAT) is further employed for efficient model compression. Experiments demonstrate true real-time inference on GAP9 with a 6-ms frame length—achieving 3.5–4× higher throughput than prior state-of-the-art methods. The network also attains superior performance on both blind source separation and target speech extraction tasks. This work delivers a practical, energy-efficient neural speech enhancement solution for low-power hearables.

Technology Category

Application Category

📝 Abstract

Speech separation on hearable devices can enable transformative augmented and enhanced hearing capabilities. However, state-of-the-art speech separation networks cannot run in real-time on tiny, low-power neural accelerators designed for hearables, due to their limited compute capabilities. We present TF-MLPNet, the first speech separation network capable of running in real-time on such low-power accelerators while outperforming existing streaming models for blind speech separation and target speech extraction. Our network operates in the time-frequency domain, processing frequency sequences with stacks of fully connected layers that alternate along the channel and frequency dimensions, and independently processing the time sequence at each frequency bin using convolutional layers. Results show that our mixed-precision quantization-aware trained (QAT) model can process 6 ms audio chunks in real-time on the GAP9 processor, achieving a 3.5-4x runtime reduction compared to prior speech separation models.

Problem

Research questions and friction points this paper is trying to address.

Real-time speech separation on low-power hearable devices

Overcoming compute limitations of tiny neural accelerators

Achieving faster processing than prior speech separation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time speech separation on low-power accelerators

Time-frequency domain processing with MLP stacks

Mixed-precision quantization-aware training for efficiency

🔎 Similar Papers

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation