Audio-Visual Continual Test-Time Adaptation without Forgetting

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the performance degradation and catastrophic forgetting experienced by audio-visual models during continual testing in unlabeled, non-stationary target domains due to distribution shifts. To tackle this challenge without access to source data, the authors propose a test-time adaptation method that fine-tunes only the modality fusion layer. Leveraging the observation that fusion-layer parameters exhibit strong cross-task transferability, the approach employs a dynamic parameter retrieval mechanism to selectively fetch optimal fusion parameters from a buffer based on small batches of incoming test samples. This strategy effectively mitigates catastrophic forgetting while enhancing generalization. Experimental results demonstrate that the proposed method significantly outperforms existing approaches on benchmarks involving both single- and dual-modality perturbations.

Technology Category

Application Category

📝 Abstract

Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.

Problem

Research questions and friction points this paper is trying to address.

audio-visual

continual test-time adaptation

catastrophic forgetting

distribution shift

cross-modal learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual test-time adaptation

audio-visual learning

catastrophic forgetting