Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This paper addresses the low token efficiency in video generation and reconstruction by proposing a content-aware, temporally causal adaptive video tokenization method. It dynamically allocates frame-level tokens within a 1D latent space, enabling sample-level flexible tokenization under strict token budget constraints. Key contributions include: (1) a block-wise causal quality scorer that preserves temporal causality for robust sequence modeling; (2) a block masking training strategy coupled with random tail dropping to enhance generalization and robustness; and (3) an integer linear programming–based optimization framework that enables precise, budget-controllable token allocation. Evaluated on UCF-101 and Kinetics-600, the method significantly improves both video reconstruction fidelity and generative performance—without requiring auxiliary image data—achieving efficient, budget-adaptive token modeling across varying computational constraints.

Technology Category

Application Category

📝 Abstract

We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

Problem

Research questions and friction points this paper is trying to address.

Adaptive token allocation for video frames based on content

Temporally dynamic token distribution under fixed budget constraints

Improving video reconstruction and generation efficiency without extra data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token allocation based on content

Block-wise masking strategy during training

Integer linear programming for token usage

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval