🤖 AI Summary
This work addresses the challenge of spatial and temporal distribution shifts in unsupervised video domain adaptation by proposing MetaTrans, a novel approach that explicitly decouples spatial and temporal domain offsets. MetaTrans introduces a temporal-static subtraction module to disentangle dynamic and static features, which are then jointly optimized through a dual-loss objective. The method employs a concise yet purpose-built network architecture that effectively mitigates both types of domain shift. Extensive experiments demonstrate that MetaTrans substantially outperforms current state-of-the-art methods across multiple cross-domain action recognition benchmarks, achieving significant improvements in both absolute and relative adaptation performance.
📝 Abstract
Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.