SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing video tokenization methods struggle to simultaneously achieve high compression ratios and high-fidelity reconstruction. This paper proposes SweetTok, a semantic-aware spatiotemporal divide-and-conquer video tokenizer built upon the VQ-VAE framework for efficient discrete representation. Its key contributions are: (1) the first LLM-embedding-driven semantic codebook, significantly enhancing codebook expressivity; (2) a curriculum learning strategy to improve training stability of discrete representations; and (3) decoupled spatiotemporal modeling via learnable query tokens and a Cross-attention Query Autoencoder (CQAE). Experiments demonstrate that SweetTok achieves state-of-the-art reconstruction quality using only 25% of the tokens: gFVD improves by 32.9%, UCF-101 rFVD by 57.1%, and ImageNet-1K rFID by 37.2%. Moreover, it enables LLM-powered few-shot recognition.

Technology Category

Application Category

📝 Abstract
This paper presents the extbf{S}emantic-a extbf{W}ar extbf{E} spatial-t extbf{E}mporal extbf{T}okenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a extbf{C}ross-attention extbf{Q}uery extbf{A}uto extbf{E}ncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only extbf{25%} of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by extbf{32.9%} w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by extbf{57.1%} w.r.t rFVD on UCF-101 and extbf{37.2%} w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
Problem

Research questions and friction points this paper is trying to address.

Overcomes limitations in video tokenization methods
Compresses video tokens while maintaining high fidelity
Enhances semantic representation of appearance and motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Query Autoencoder for video compression
Motion-enhanced Language Codebook for semantic compression
Semantic-aware tokens enable few-shot recognition
🔎 Similar Papers
No similar papers found.
Z
Zhentao Tan
Kuaishou Technology, Beijing, China
B
Ben Xue
Kuaishou Technology, Beijing, China
Jian Jia
Jian Jia
Institute of Automation, Chinese Academy of Sciences (CASIA)
computer vision
J
Junhao Wang
Kuaishou Technology, Beijing, China
W
Wencai Ye
Kuaishou Technology, Beijing, China
Shaoyun Shi
Shaoyun Shi
Tsinghua University
RecommendationDeep Learning
Mingjie Sun
Mingjie Sun
Thinking Machines Lab
W
Wenjin Wu
Kuaishou Technology, Beijing, China
Q
Quan Chen
Kuaishou Technology, Beijing, China
P
Peng Jiang
Kuaishou Technology, Beijing, China