Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual segmentation methods struggle to adapt to dynamically shifting data distributions under replay-free conditions due to a lack of continual learning capabilities. This work establishes the first replay-free continual learning benchmark tailored for audio-visual segmentation, encompassing four learning protocols across both single-source and multi-source scenarios, and introduces ATLAS, a novel baseline method. ATLAS integrates audio-guided pre-fusion channel modulation, cross-modal attention mechanisms, and a loss-sensitivity-aware low-rank weight anchoring strategy to effectively mitigate catastrophic forgetting. Extensive experiments demonstrate that ATLAS achieves superior performance across diverse continual learning settings, laying a solid foundation for lifelong multimodal perception.

Technology Category

Application Category

📝 Abstract
Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}
Problem

Research questions and friction points this paper is trying to address.

Continual Learning
Audio-Visual Segmentation
Multi-Modal Learning
Catastrophic Forgetting
Dynamic Environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Learning
Audio-Visual Segmentation
Exemplar-Free
Low-Rank Anchoring
Multi-Modal Learning
🔎 Similar Papers
No similar papers found.
S
Siddeshwar Raghavan
Purdue University, West Lafayette, IN 47906, USA
Gautham Vinod
Gautham Vinod
PhD Candidate in ECE, Purdue University
Computer VisionSmart HealthImage ProcessingDeep Learning
B
Bruce Coburn
Purdue University, West Lafayette, IN 47906, USA
F
Fengqing Zhu
Purdue University, West Lafayette, IN 47906, USA