π€ AI Summary
To address the inefficiency and performance limitations caused by spatiotemporal coupling in video tokenization, this paper proposes DeRAβa decoupled 1D video tokenizer. DeRA employs a dual-stream architecture that separately models spatial semantics and temporal dynamics within a shared 1D latent space. To mitigate gradient conflicts arising from heterogeneous supervision, it introduces the Symmetric Alignment-Conflict Projection (SACP) module, which enables joint optimization via gradient decomposition and directional regulation. The method integrates feature alignment with pretrained vision models and hierarchical dual-stream encoding. Experiments demonstrate that DeRA achieves a 25% improvement in rFVD over LARP on UCF-101; moreover, it attains state-of-the-art performance on both conditional video generation (UCF-101) and frame prediction (Kinetics-600), validating its effectiveness in learning disentangled spatiotemporal representations.
π Abstract
This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.