Online Audio-Visual Autoregressive Speaker Extraction

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address high latency and poor robustness during dynamic speaker switching in streaming multi-talker audio-visual speech separation, this paper proposes a low-latency online audio-visual joint separation framework. Methodologically, it introduces, for the first time, a lightweight depthwise separable convolutional visual frontend (0.1M parameters, 2.1 MACs/s) jointly optimized with an autoregressive acoustic encoder, forming an end-to-end online audio-visual fusion architecture. Key contributions include: (1) a visual frontend achieving state-of-the-art performance with significantly reduced computational cost; (2) autoregressive modeling improving SI-SNRi by 0.9 dB; and (3) the first systematic evaluation and enhancement of algorithmic robustness under attention-switching conditions. Experiments demonstrate that the method maintains millisecond-level latency while substantially improving separation quality and stability in dynamic target-speaker scenarios.

Technology Category

Application Category

📝 Abstract

This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.

Problem

Research questions and friction points this paper is trying to address.

Optimizing lightweight visual frontend for online speaker extraction

Exploring autoregressive acoustic encoder for past speech signal utilization

Evaluating algorithm performance during target speaker attention shifts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual frontend with depth-wise convolution

Autoregressive acoustic encoder for past speech signals

Robust performance during focus of attention changes

🔎 Similar Papers

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention