REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the low compression ratio in video generation caused by over-reliance on pixel-level reconstruction in embedding learning, this paper proposes a visually plausible reconstruction paradigm, establishing an encoder–generator joint framework that prioritizes semantic fidelity over exact pixel-wise reproduction. We innovatively introduce diffusion Transformers (DiTs) into latent-space decoding, design a lightweight latent-conditioning module, and enable end-to-end co-optimization of compression and generation. Our method achieves up to 32× temporal compression—8× higher than the state of the art—while preserving downstream text-to-video generation quality. It significantly reduces GPU memory consumption and training/inference overhead. Extensive experiments demonstrate that our paradigm consistently balances high compression ratios with high visual fidelity across multiple benchmarks, offering a novel pathway toward efficient video generation modeling.

Technology Category

Application Category

📝 Abstract

We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.

Problem

Research questions and friction points this paper is trying to address.

Improves video compression without quality loss

Uses diffusion transformer for video synthesis

Enhances efficiency in text-to-video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-generator framework with diffusion transformer

Latent conditioning module for video embedding

Achieves 32x temporal video compression ratio

🔎 Similar Papers

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding