InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the redundancy or information loss caused by fixed compression ratios in discrete tokenization of long videos, this work pioneers the integration of Shannon information theory into video tokenizer design. We rigorously prove the suboptimality of data-agnostic training and propose an adaptive video tokenization framework based on Evidence Lower Bound (ELBO) optimization. The framework unifies variational inference, information-theoretic modeling, and Transformer architectures to enable content-aware, dynamic bit-rate allocation. Experiments demonstrate that, without performance degradation, our method reduces token count by 20% compared to fixed-rate baselines and achieves a 2.3× average compression ratio—substantially outperforming existing heuristic adaptive approaches. The core contribution lies in a theory-driven, approximately optimal adaptive compression mechanism grounded in information-theoretic principles.

Technology Category

Application Category

📝 Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Problem

Research questions and friction points this paper is trying to address.

Adaptive video tokenization addresses variable information density in sequences.

Existing tokenizers cause redundancy or loss by fixed-rate compression.

InfoTok optimizes token allocation based on informational richness for efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive video tokenization using information-theoretic compression

ELBO-based algorithm achieving theoretical optimality in tokenization

Transformer-based compressor saving tokens while maintaining performance

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval