Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing video tokenization methods predominantly rely on deterministic VAE decoders, limiting both reconstruction fidelity and generative flexibility. This paper introduces the first video tokenizer based on a conditional 3D causal diffusion model, abandoning conventional VAE decoder architectures and instead employing diffusion-based reconstruction in latent space for arbitrary-length, high-fidelity video synthesis. Key contributions include: (i) the first integration of diffusion generation into the tokenizer’s decoding stage, enabling high-quality single-step sampling; (ii) a unified latent-space encoder-diffusion joint training framework; and (iii) feature caching and accelerated sampling strategies to balance efficiency and fidelity. Experiments demonstrate that our tokenizer achieves state-of-the-art performance on video reconstruction—outperforming mainstream VAEs even with single-step sampling. A lightweight variant matches the performance of the top-two baselines. Furthermore, downstream latent video generation models built upon our tokenizer exhibit显著 improvements in generation quality.

Technology Category

Application Category

📝 Abstract

Video tokenizers, which transform videos into compact latent representations, are key to video generation. Existing video tokenizers are based on the VAE architecture and follow a paradigm where an encoder compresses videos into compact latents, and a deterministic decoder reconstructs the original videos from these latents. In this paper, we propose a novel underline{ extbf{C}}onditioned underline{ extbf{D}}iffusion-based video underline{ extbf{T}}okenizer entitled extbf{ourmethod}, which departs from previous methods by replacing the deterministic decoder with a 3D causal diffusion model. The reverse diffusion generative process of the decoder is conditioned on the latent representations derived via the encoder. With a feature caching and sampling acceleration, the framework efficiently reconstructs high-fidelity videos of arbitrary lengths. Results show that {ourmethod} achieves state-of-the-art performance in video reconstruction tasks using just a single-step sampling. Even a smaller version of {ourmethod} still achieves reconstruction results on par with the top two baselines. Furthermore, the latent video generation model trained using {ourmethod} also shows superior performance.

Problem

Research questions and friction points this paper is trying to address.

Improves video tokenization using conditioned diffusion models

Replaces deterministic decoders with 3D causal diffusion models

Achieves high-fidelity video reconstruction with single-step sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditioned diffusion model replaces deterministic decoder

3D causal diffusion for video tokenization

Feature caching and sampling acceleration enhance efficiency

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

2024-09-02arXiv.orgCitations: 3

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence