Test-Time Adaptation for Combating Missing Modalities in Egocentric Videos

📅 2024-04-23

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address multimodal data missingness in first-person videos—arising from privacy constraints, hardware limitations, or efficiency requirements—this paper proposes a test-time adaptation (TTA) framework that operates without retraining. Methodologically, it introduces the first fully online, self-supervised approach to handle arbitrary modality dropouts: it minimizes mutual information to disentangle modality-agnostic action representations from modality-specific source signals, and incorporates a self-distillation mechanism to preserve dual-modal performance consistency under single-modal inference. The framework is model-agnostic, seamlessly integrating with diverse pre-trained architectures. Evaluated across multiple benchmarks, it achieves significant improvements in both action recognition and temporal action localization—particularly under audio- or vision-only input conditions—while incurring zero retraining overhead. It demonstrates strong generalization across domains and deployment scenarios, offering practical applicability for real-world first-person video understanding systems.

Technology Category

Application Category

📝 Abstract

Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.

Problem

Research questions and friction points this paper is trying to address.

Addresses missing modalities in egocentric videos without retraining.

Proposes test-time adaptation to handle incomplete sensory inputs.

Introduces MiDl for self-supervised, online modality adaptation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time adaptation for missing modalities

Mutual information minimization with self-distillation

Self-supervised online solution without retraining

🔎 Similar Papers

No similar papers found.