VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video diffusion models employ tokenizers with fixed temporal compression ratios, causing computational cost to scale linearly with frame rate. To address this, we propose the “duration-proportional information hypothesis”: the upper bound of video information capacity depends on duration—not frame count. Guided by this principle, we design VFRTok, a Transformer-based variable-frame-rate video tokenizer that achieves flexible spatiotemporal compression via asymmetric frame-rate training of its encoder and decoder. We further introduce Partial RoPE—a variant of rotary position embedding—that explicitly decouples positional modeling from content modeling, while semantically grouping correlated image patches into unified tokens. Experiments demonstrate that VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity using only 1/8 the number of tokens required by prior methods, significantly reducing the computational overhead of diffusion-based video generation.

Technology Category

Application Category

📝 Abstract

Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.

Problem

Research questions and friction points this paper is trying to address.

Inefficient video tokenization in Latent Diffusion Models

Fixed temporal compression rates increase computational costs

Need for variable frame rate encoding and decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable frame rate video tokenizer VFRTok

Asymmetric frame rate training for encoding

Partial Rotary Position Embeddings for content-awareness

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval