Two-stage Audio-Visual Target Speaker Extraction System for Real-Time Processing On Edge Device

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing audio-visual target speaker extraction (AVTSE) methods suffer from high computational complexity, hindering real-time deployment on edge devices. This paper proposes a decoupled two-stage lightweight framework: Stage I employs a lightweight CNN-based visual module for voice activity detection (VAD); Stage II fuses VAD outputs with acoustic features to estimate time-frequency masks for target speech separation. Crucially, visual guidance is disentangled from end-to-end joint modeling, substantially reducing computational overhead. The method further incorporates cross-modal temporal alignment and a low-complexity acoustic encoder. Evaluated on CHiME-5 and LRS3, it achieves a 12.3 dB SISDR improvement, with only 1.2M parameters. On an ARM Cortex-A76 platform, inference latency is under 30 ms and power consumption is reduced by 76%, marking the first real-time feasible AVTSE implementation on resource-constrained edge hardware.

Technology Category

Application Category

📝 Abstract

Audio-Visual Target Speaker Extraction (AVTSE) aims to isolate a target speaker's voice in a multi-speaker environment with visual cues as auxiliary. Most of the existing AVTSE methods encode visual and audio features simultaneously, resulting in extremely high computational complexity and making it impractical for real-time processing on edge devices. To tackle this issue, we proposed a two-stage ultra-compact AVTSE system. Specifically, in the first stage, a compact network is employed for voice activity detection (VAD) using visual information. In the second stage, the VAD results are combined with audio inputs to isolate the target speaker's voice. Experiments show that the proposed system effectively suppresses background noise and interfering voices while spending little computational resources.

Problem

Research questions and friction points this paper is trying to address.

Isolating target speaker's voice in multi-speaker environments

Reducing computational complexity for real-time edge processing

Combining visual and audio cues efficiently for speaker extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage ultra-compact AVTSE system

Visual-based VAD for initial processing

Combines VAD results with audio inputs

🔎 Similar Papers

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention