🤖 AI Summary
This work addresses real-time single-microphone speech enhancement for UAVs under severe self-noise and stringent resource constraints. We propose a lightweight band-fusion attention network that innovatively integrates frequency-domain Transformers with subband encoding. Key methodological contributions include a learnable gated fusion mechanism, a hybrid full-band/subband encoder-decoder architecture, a temporal convolutional network (TCN) backend, and a joint spectral-temporal loss function—enabling low-latency streaming inference. Experiments on VoiceBank-DEMAND and realistic UAV noise datasets demonstrate robust spectral reconstruction at extremely low SNRs: PESQ improves by over 1.2 points. The model reduces computational complexity and memory footprint by 47% and 53%, respectively, while maintaining real-time performance—fulfilling the strict deployment requirements of onboard UAV platforms.
📝 Abstract
This paper proposes DroFiT (Drone Frequency lightweight Transformer for speech enhancement, a single microphone speech enhancement network for severe drone self-noise. DroFit integrates a frequency-wise Transformer with a full/sub-band hybrid encoder-decoder and a TCN back-end for memory-efficient streaming. A learnable skip-and-gate fusion with a combined spectral-temporal loss further refines reconstruction. The model is trained on VoiceBank-DEMAND mixed with recorded drone noise (-5 to -25 dB SNR) and evaluate using standard speech enhancement metrics and computational efficiency. Experimental results show that DroFiT achieves competitive enhancement performance while significantly reducing computational and memory demands, paving the way for real-time processing on resource-constrained UAV platforms. Audio demo samples are available on our demo page.