DeRA: Decoupled Representation Alignment for Video Tokenization

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the inefficiency and performance limitations caused by spatiotemporal coupling in video tokenization, this paper proposes DeRA—a decoupled 1D video tokenizer. DeRA employs a dual-stream architecture that separately models spatial semantics and temporal dynamics within a shared 1D latent space. To mitigate gradient conflicts arising from heterogeneous supervision, it introduces the Symmetric Alignment-Conflict Projection (SACP) module, which enables joint optimization via gradient decomposition and directional regulation. The method integrates feature alignment with pretrained vision models and hierarchical dual-stream encoding. Experiments demonstrate that DeRA achieves a 25% improvement in rFVD over LARP on UCF-101; moreover, it attains state-of-the-art performance on both conditional video generation (UCF-101) and frame prediction (Kinetics-600), validating its effectiveness in learning disentangled spatiotemporal representations.

Technology Category

Application Category

📝 Abstract

This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.

Problem

Research questions and friction points this paper is trying to address.

Decouples spatial-temporal learning in video tokenization

Aligns appearance and motion streams with foundation models

Addresses gradient conflicts via Symmetric Alignment-Conflict Projection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples spatial-temporal representation learning in video tokenization

Aligns appearance and motion streams with pretrained vision models

Uses Symmetric Alignment-Conflict Projection to resolve gradient conflicts

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

2024-09-02arXiv.orgCitations: 3

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4