🤖 AI Summary
To address high latency and poor robustness during dynamic speaker switching in streaming multi-talker audio-visual speech separation, this paper proposes a low-latency online audio-visual joint separation framework. Methodologically, it introduces, for the first time, a lightweight depthwise separable convolutional visual frontend (0.1M parameters, 2.1 MACs/s) jointly optimized with an autoregressive acoustic encoder, forming an end-to-end online audio-visual fusion architecture. Key contributions include: (1) a visual frontend achieving state-of-the-art performance with significantly reduced computational cost; (2) autoregressive modeling improving SI-SNRi by 0.9 dB; and (3) the first systematic evaluation and enhancement of algorithmic robustness under attention-switching conditions. Experiments demonstrate that the method maintains millisecond-level latency while substantially improving separation quality and stability in dynamic target-speaker scenarios.
📝 Abstract
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.