Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Video tokenizers struggle to achieve temporal compression ratios exceeding 4× without increasing channel dimensions while maintaining high reconstruction fidelity. Method: This paper introduces a progressive growth architecture and a bootstrapped training paradigm: (i) reusing intermediate representations from low-compression-rate models to guide the training of high-compression modules; (ii) designing a cross-level feature mixing module for efficient knowledge transfer; and (iii) incorporating temporal subsampling-based reconstruction supervision alongside phased freezing and fine-tuning. Results: Evaluated on standard video benchmarks, our approach significantly improves reconstruction quality at high compression ratios (≥4×). It achieves superior synthesis quality for the same token budget or reduces token count required to attain equivalent quality—thereby enabling more efficient training and storage for downstream generative models.

Technology Category

Application Category

📝 Abstract

Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.

Problem

Research questions and friction points this paper is trying to address.

Video Compression

Data Processing

Quality Retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Enhancement Compression

Video Quality

Efficient Model Training

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval