Moving Speaker Separation via Parallel Spectral-Spatial Processing

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge of multi-channel speech separation in dynamic environments, where spectral and spatial features co-evolve across multiple time scales, a scenario poorly handled by conventional serial architectures. The authors propose a novel Parallel Spectral-Spatial (PS2) architecture featuring a dual-branch parallel design that explicitly decouples these two feature types for the first time. The spectral branch integrates BLSTM, Mamba, and self-attention mechanisms, while the spatial branch employs bidirectional GRUs to model source-microphone geometric relationships. Adaptive fusion is achieved through cross-attention between the branches. Evaluated under moving speaker conditions, the method significantly outperforms current state-of-the-art approaches, yielding improvements of 1.6–2.2 dB in SI-SDR. Notably, it maintains robust performance even under extreme conditions—including high reverberation, strong noise, and rapid motion—with gains exceeding 13 dB, demonstrating exceptional dynamic robustness.

Technology Category

Application Category

📝 Abstract

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

Problem

Research questions and friction points this paper is trying to address.

moving speaker separation

multi-channel speech separation

dynamic environments

spectral-spatial features

modeling conflict

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel spectral-spatial processing

moving speaker separation

cross-attention fusion