π€ AI Summary
Existing video tokenization methods typically process frames independently, failing to capture temporal dependencies and redundancy. This work proposes RefTokβthe first video tokenization method introducing a reference-frame mechanism: it jointly encodes frame sequences conditioned on an unquantized reference frame, explicitly modeling temporal dynamics and residual context while achieving both high compression efficiency and high-fidelity reconstruction. Its key innovation lies in reference-frame guidance, which preserves fine-grained details (e.g., facial features, text, small-scale textures) across frames and enables context-aware generation. On four standard benchmarks, RefTok consistently outperforms Cosmos and MAGVIT, achieving average improvements of 36.7% in PSNR, SSIM, and LPIPS. In the BAIR action prediction task, a generative model built upon RefTok surpasses the significantly larger MAGVIT-L by 27.9%, demonstrating superior spatiotemporal modeling capability.
π Abstract
Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.