Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

202K/year
πŸ€– AI Summary
This work addresses the challenge of unsupervised domain adaptation in video analysis, where static background content often exacerbates domain shift and existing approaches suffer from poor computational efficiency. To this end, the paper introduces, for the first time, a dynamic-aware learnable motion-focused tokenization mechanism. By partitioning video frames into patches and selectively retaining tokens based on motion sensitivity, the method automatically discards redundant low-motion background tokens while preserving those rich in action-related information for domain adaptation. Evaluated across three standard benchmarks and 21 domain transfer settings, the proposed approach consistently outperforms current state-of-the-art methods, achieving higher accuracy while significantly reducing computational overheadβ€”thus offering an effective balance between performance and practicality.

Technology Category

Application Category

πŸ“ Abstract
Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.
Problem

Research questions and friction points this paper is trying to address.

Video Unsupervised Domain Adaptation
action recognition
domain shift
computational efficiency
background redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable Motion-Focused Tokenization
Video Unsupervised Domain Adaptation
Motion-aware Token Pruning
Computational Efficiency
Action Recognition
πŸ”Ž Similar Papers