🤖 AI Summary
This paper addresses the low token efficiency in video generation and reconstruction by proposing a content-aware, temporally causal adaptive video tokenization method. It dynamically allocates frame-level tokens within a 1D latent space, enabling sample-level flexible tokenization under strict token budget constraints. Key contributions include: (1) a block-wise causal quality scorer that preserves temporal causality for robust sequence modeling; (2) a block masking training strategy coupled with random tail dropping to enhance generalization and robustness; and (3) an integer linear programming–based optimization framework that enables precise, budget-controllable token allocation. Evaluated on UCF-101 and Kinetics-600, the method significantly improves both video reconstruction fidelity and generative performance—without requiring auxiliary image data—achieving efficient, budget-adaptive token modeling across varying computational constraints.
📝 Abstract
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.