🤖 AI Summary
Existing video tokenization methods struggle to simultaneously achieve high compression ratios and high-fidelity reconstruction. This paper proposes SweetTok, a semantic-aware spatiotemporal divide-and-conquer video tokenizer built upon the VQ-VAE framework for efficient discrete representation. Its key contributions are: (1) the first LLM-embedding-driven semantic codebook, significantly enhancing codebook expressivity; (2) a curriculum learning strategy to improve training stability of discrete representations; and (3) decoupled spatiotemporal modeling via learnable query tokens and a Cross-attention Query Autoencoder (CQAE). Experiments demonstrate that SweetTok achieves state-of-the-art reconstruction quality using only 25% of the tokens: gFVD improves by 32.9%, UCF-101 rFVD by 57.1%, and ImageNet-1K rFID by 37.2%. Moreover, it enables LLM-powered few-shot recognition.
📝 Abstract
This paper presents the extbf{S}emantic-a extbf{W}ar extbf{E} spatial-t extbf{E}mporal extbf{T}okenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a extbf{C}ross-attention extbf{Q}uery extbf{A}uto extbf{E}ncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only extbf{25%} of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by extbf{32.9%} w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by extbf{57.1%} w.r.t rFVD on UCF-101 and extbf{37.2%} w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.