Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation

๐Ÿ“… 2025-06-14
๐Ÿ›๏ธ IEEE transactions on circuits and systems for video technology (Print)
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video test-time adaptation (TTA) methods rely solely on visual signals, overlooking the synergistic potential of audio semantics. This paper introduces audio as the first modality into video TTA, proposing an audio-augmented pseudo-labeling framework: it leverages a pretrained audio model (e.g., AST) to generate audio classification confidence scores and employs a large language model to achieve cross-modal semantic alignment between audio and visual label spaces. We further design a dynamic iterative adaptation mechanism grounded in loss minimization and multi-view consistency, enabling sample-wise adaptation and automatic termination. Evaluated on UCF101-C, Kinetics-Sounds-C, and two newly constructed benchmarksโ€”AVE-C and AVMIT-Cโ€”the method consistently improves TTA performance of mainstream video classifiers (e.g., ResNet, ViT) under diverse corruptions. Results demonstrate that audio signals provide critical gains for robust video understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Problem

Research questions and friction points this paper is trying to address.

Incorporating audio data into video test-time adaptation
Mapping audio predictions to video labels using LLMs
Customizing adaptation iterations based on loss consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-assisted pseudo-labels enhance video TTA
Audio-to-video label mapping via LLMs
Flexible adaptation cycle customizes iteration count
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Runhao Zeng
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen, 518172, China
Q
Qi Deng
School of Software Engineering, South China University of Technology, Guangzhou, 510000, China
Ronghao Zhang
Ronghao Zhang
Unknown affiliation
PsycholinguisticsComputaional Linguistics
Shuaicheng Niu
Shuaicheng Niu
Nanyang Technological University
Machine LearningDomain AdaptationRobustnessAutoML
J
Jian Chen
School of Software Engineering, South China University of Technology, Guangzhou, 510000, China
Xiping Hu
Xiping Hu
Professor in Beijing Institute of Technology
Cyber-Physical SystemCrowd ComputingAffective Computing
Victor C. M. Leung
Victor C. M. Leung
SMBU / Shenzhen University / The University of British Columbia
communication systemswireless networksmobile systems