🤖 AI Summary
To address the inefficiency and weak representational capacity of discrete tokenization for high-dimensional video data, this paper proposes Mamba-VQ: an encoder-decoder architecture leveraging the state-space model (Mamba) coupled with a Channel-Split Quantization (CSQ) mechanism. CSQ enhances latent space capacity without increasing token count by performing independent quantization across grouped channels. The Mamba-based encoder effectively captures long-range spatiotemporal dependencies, overcoming sequence modeling limitations inherent in Transformers and causal 3D convolutions. Experiments demonstrate that Mamba-VQ achieves state-of-the-art (SOTA) performance on standard video tokenization benchmarks—including Kinetics and UCF101—and exhibits superior fidelity and generalization robustness in autoregressive video generation tasks.
📝 Abstract
Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.