🤖 AI Summary
Existing video tokenization methods predominantly rely on deterministic VAE decoders, limiting both reconstruction fidelity and generative flexibility. This paper introduces the first video tokenizer based on a conditional 3D causal diffusion model, abandoning conventional VAE decoder architectures and instead employing diffusion-based reconstruction in latent space for arbitrary-length, high-fidelity video synthesis. Key contributions include: (i) the first integration of diffusion generation into the tokenizer’s decoding stage, enabling high-quality single-step sampling; (ii) a unified latent-space encoder-diffusion joint training framework; and (iii) feature caching and accelerated sampling strategies to balance efficiency and fidelity. Experiments demonstrate that our tokenizer achieves state-of-the-art performance on video reconstruction—outperforming mainstream VAEs even with single-step sampling. A lightweight variant matches the performance of the top-two baselines. Furthermore, downstream latent video generation models built upon our tokenizer exhibit显著 improvements in generation quality.
📝 Abstract
Video tokenizers, which transform videos into compact latent representations, are key to video generation. Existing video tokenizers are based on the VAE architecture and follow a paradigm where an encoder compresses videos into compact latents, and a deterministic decoder reconstructs the original videos from these latents. In this paper, we propose a novel underline{ extbf{C}}onditioned underline{ extbf{D}}iffusion-based video underline{ extbf{T}}okenizer entitled extbf{ourmethod}, which departs from previous methods by replacing the deterministic decoder with a 3D causal diffusion model. The reverse diffusion generative process of the decoder is conditioned on the latent representations derived via the encoder. With a feature caching and sampling acceleration, the framework efficiently reconstructs high-fidelity videos of arbitrary lengths. Results show that {ourmethod} achieves state-of-the-art performance in video reconstruction tasks using just a single-step sampling. Even a smaller version of {ourmethod} still achieves reconstruction results on par with the top two baselines. Furthermore, the latent video generation model trained using {ourmethod} also shows superior performance.