LightAVSeg: Lightweight Audio-Visual Segmentation

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of deploying audio-visual segmentation models on resource-constrained devices, where existing methods rely on computationally intensive cross-modal attention mechanisms. To overcome this limitation, the authors propose a lightweight framework that decouples cross-modal interaction into two stages—semantic filtering and spatial localization—both operating with linear complexity, thereby replacing conventional attention mechanisms. Additionally, they introduce an auxiliary alignment loss that incurs no inference overhead yet enhances training consistency. The resulting model contains only 20.5 million parameters—approximately one-seventh the size of AVSegFormer—and achieves a competitive 50.4 mIoU on the MS3 benchmark while enabling efficient mobile inference.

📝 Abstract

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Segmentation

Lightweight Model

Cross-modal Attention

Computational Efficiency

Interaction Bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight AVS

Decoupled Cross-modal Interaction

Linear Complexity