🤖 AI Summary
This work addresses the challenge of efficiently tracking a moving speaker to maintain high-quality speech enhancement in dynamic acoustic environments where only the speaker’s initial direction is known. The authors propose a lightweight Bayesian tracking framework that is compatible with any deep spatially selective filter and incorporates an autoregressive feedback mechanism: enhanced speech from the previous frame guides the spatial filtering of the current frame, enabling causal, real-time, and accurate joint tracking and enhancement. To improve simulation realism, they also construct and publicly release a speaker trajectory dataset based on the social force model. The method achieves significant improvements in both tracking accuracy and speech enhancement performance with negligible additional computational cost, demonstrating strong generalization across both simulated and real-world complex acoustic scenarios.
📝 Abstract
Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers' initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.