SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

📅 2024-12-11

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing video tokenization methods struggle to simultaneously achieve high compression ratios and high-fidelity reconstruction. This paper proposes SweetTok, a semantic-aware spatiotemporal divide-and-conquer video tokenizer built upon the VQ-VAE framework for efficient discrete representation. Its key contributions are: (1) the first LLM-embedding-driven semantic codebook, significantly enhancing codebook expressivity; (2) a curriculum learning strategy to improve training stability of discrete representations; and (3) decoupled spatiotemporal modeling via learnable query tokens and a Cross-attention Query Autoencoder (CQAE). Experiments demonstrate that SweetTok achieves state-of-the-art reconstruction quality using only 25% of the tokens: gFVD improves by 32.9%, UCF-101 rFVD by 57.1%, and ImageNet-1K rFID by 37.2%. Moreover, it enables LLM-powered few-shot recognition.

Technology Category

Application Category

📝 Abstract

This paper presents the extbf{S}emantic-a extbf{W}ar extbf{E} spatial-t extbf{E}mporal extbf{T}okenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a extbf{C}ross-attention extbf{Q}uery extbf{A}uto extbf{E}ncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only extbf{25%} of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by extbf{32.9%} w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by extbf{57.1%} w.r.t rFVD on UCF-101 and extbf{37.2%} w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Overcomes limitations in video tokenization methods

Compresses video tokens while maintaining high fidelity

Enhances semantic representation of appearance and motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Query Autoencoder for video compression

Motion-enhanced Language Codebook for semantic compression

Semantic-aware tokens enable few-shot recognition

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM