Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of unsupervised domain adaptation in video analysis, where static background content often exacerbates domain shift and existing approaches suffer from poor computational efficiency. To this end, the paper introduces, for the first time, a dynamic-aware learnable motion-focused tokenization mechanism. By partitioning video frames into patches and selectively retaining tokens based on motion sensitivity, the method automatically discards redundant low-motion background tokens while preserving those rich in action-related information for domain adaptation. Evaluated across three standard benchmarks and 21 domain transfer settings, the proposed approach consistently outperforms current state-of-the-art methods, achieving higher accuracy while significantly reducing computational overhead—thus offering an effective balance between performance and practicality.

Technology Category

Application Category

📝 Abstract

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Problem

Research questions and friction points this paper is trying to address.

Video Unsupervised Domain Adaptation

action recognition

domain shift

computational efficiency

background redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable Motion-Focused Tokenization

Video Unsupervised Domain Adaptation

Motion-aware Token Pruning