π€ AI Summary
This work addresses real-time multi-channel speech enhancement under unconstrained microphone arraysβwhere the number and geometric configuration of microphones are variable. We propose a lightweight attention-based beamforming network that employs a three-stage feature extraction and fusion framework, incorporating cross-channel attention to achieve microphone-invariant modeling for the first time at extremely low computational complexity. The method enables adaptive feature aggregation across arbitrary numbers and configurations of microphones. By integrating time-frequency domain beamforming with a compact neural architecture, it achieves real-time inference on edge devices. Experiments demonstrate that the model achieves superior speech quality and intelligibility over existing lightweight approaches, while reducing parameters (<1M) and computational cost (<0.5 GFLOPs). Crucially, it maintains robust performance across diverse array geometries, establishing a new paradigm for on-device speech enhancement in unconstrained array scenarios.
π Abstract
Multichannel speech enhancement (SE) aims to restore clean speech from noisy measurements by leveraging spatiotemporal signal features. In ad-hoc array conditions, microphone invariance (MI) requires systems to handle different microphone numbers and array geometries. From a practical perspective, multichannel recordings inevitably increase the computational burden for edge-device applications, highlighting the necessity of lightweight and efficient deployments. In this work, we propose a lightweight attentive beamforming network (LABNet) to integrate MI in a low-complexity real-time SE system. We design a three-stage framework for efficient intra-channel modeling and inter-channel interaction. A cross-channel attention module is developed to aggregate features from each channel selectively. Experimental results demonstrate our LABNet achieves impressive performance with ultra-light resource overhead while maintaining the MI, indicating great potential for ad-hoc array processing.