Online Audio-Visual Autoregressive Speaker Extraction

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency and poor robustness during dynamic speaker switching in streaming multi-talker audio-visual speech separation, this paper proposes a low-latency online audio-visual joint separation framework. Methodologically, it introduces, for the first time, a lightweight depthwise separable convolutional visual frontend (0.1M parameters, 2.1 MACs/s) jointly optimized with an autoregressive acoustic encoder, forming an end-to-end online audio-visual fusion architecture. Key contributions include: (1) a visual frontend achieving state-of-the-art performance with significantly reduced computational cost; (2) autoregressive modeling improving SI-SNRi by 0.9 dB; and (3) the first systematic evaluation and enhancement of algorithmic robustness under attention-switching conditions. Experiments demonstrate that the method maintains millisecond-level latency while substantially improving separation quality and stability in dynamic target-speaker scenarios.

Technology Category

Application Category

📝 Abstract
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.
Problem

Research questions and friction points this paper is trying to address.

Optimizing lightweight visual frontend for online speaker extraction
Exploring autoregressive acoustic encoder for past speech signal utilization
Evaluating algorithm performance during target speaker attention shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual frontend with depth-wise convolution
Autoregressive acoustic encoder for past speech signals
Robust performance during focus of attention changes
Z
Zexu Pan
Tongyi Lab, Alibaba Group, Singapore
W
Wupeng Wang
National University of Singapore, Singapore
Shengkui Zhao
Shengkui Zhao
Senior Algorithm Expert, Alibaba Group
Speech processing and large models
C
Chong Zhang
Tongyi Lab, Alibaba Group, Singapore
K
Kun Zhou
Tongyi Lab, Alibaba Group, Singapore
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
B
Bin Ma
Tongyi Lab, Alibaba Group, Singapore