RefTok: Reference-Based Tokenization for Video Generation

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing video tokenization methods typically process frames independently, failing to capture temporal dependencies and redundancy. This work proposes RefTok—the first video tokenization method introducing a reference-frame mechanism: it jointly encodes frame sequences conditioned on an unquantized reference frame, explicitly modeling temporal dynamics and residual context while achieving both high compression efficiency and high-fidelity reconstruction. Its key innovation lies in reference-frame guidance, which preserves fine-grained details (e.g., facial features, text, small-scale textures) across frames and enables context-aware generation. On four standard benchmarks, RefTok consistently outperforms Cosmos and MAGVIT, achieving average improvements of 36.7% in PSNR, SSIM, and LPIPS. In the BAIR action prediction task, a generative model built upon RefTok surpasses the significantly larger MAGVIT-L by 27.9%, demonstrating superior spatiotemporal modeling capability.

Technology Category

Application Category

📝 Abstract

Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.

Problem

Research questions and friction points this paper is trying to address.

Handling temporal redundancy in video models

Capturing temporal dependencies in video frames

Improving video generation quality and compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-based tokenization for video generation

Encodes frames conditioned on reference frame

Improves metrics by 36.7% over state-of-the-art

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval