MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the inefficiency and weak representational capacity of discrete tokenization for high-dimensional video data, this paper proposes Mamba-VQ: an encoder-decoder architecture leveraging the state-space model (Mamba) coupled with a Channel-Split Quantization (CSQ) mechanism. CSQ enhances latent space capacity without increasing token count by performing independent quantization across grouped channels. The Mamba-based encoder effectively captures long-range spatiotemporal dependencies, overcoming sequence modeling limitations inherent in Transformers and causal 3D convolutions. Experiments demonstrate that Mamba-VQ achieves state-of-the-art (SOTA) performance on standard video tokenization benchmarks—including Kinetics and UCF101—and exhibits superior fidelity and generalization robustness in autoregressive video generation tasks.

Technology Category

Application Category

📝 Abstract

Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of sequence-based video tokenizers

Enhancing representational power of quantized latents

Improving autoregressive video generation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based encoder-decoder architecture

Channel-split quantization scheme

State-of-the-art discrete video tokenizer

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval