🤖 AI Summary
This work proposes VTok, a unified video tokenization framework that decouples spatial and temporal information to overcome the limitations of conventional frame-sampling approaches, which often yield redundant representations and struggle to support both generation and understanding tasks effectively. VTok extracts spatial features from keyframes and encodes subsequent frames as residual tokens, substantially compressing sequence length while preserving essential motion dynamics. Evaluated on the TV-Align benchmark, VTok improves accuracy by 3.4% and achieves a 1.9-point gain on VBench, demonstrating more coherent motion in generated videos and enhanced alignment with textual guidance. The method thus provides an efficient and effective foundation for dual-task scenarios encompassing both video understanding and generation.
📝 Abstract
This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.