MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and weak representational capacity of discrete tokenization for high-dimensional video data, this paper proposes Mamba-VQ: an encoder-decoder architecture leveraging the state-space model (Mamba) coupled with a Channel-Split Quantization (CSQ) mechanism. CSQ enhances latent space capacity without increasing token count by performing independent quantization across grouped channels. The Mamba-based encoder effectively captures long-range spatiotemporal dependencies, overcoming sequence modeling limitations inherent in Transformers and causal 3D convolutions. Experiments demonstrate that Mamba-VQ achieves state-of-the-art (SOTA) performance on standard video tokenization benchmarks—including Kinetics and UCF101—and exhibits superior fidelity and generalization robustness in autoregressive video generation tasks.

Technology Category

Application Category

📝 Abstract
Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of sequence-based video tokenizers
Enhancing representational power of quantized latents
Improving autoregressive video generation efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based encoder-decoder architecture
Channel-split quantization scheme
State-of-the-art discrete video tokenizer
🔎 Similar Papers
No similar papers found.