DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing video VAEs neglect inter-frame content similarity, leading to redundant latent representations. To address this, we propose the Decoupled Video VAE (D-VAE), which explicitly decomposes videos into three complementary components: keyframes (static content), motion (temporal transformations), and residuals (high-frequency details). Dedicated encoders are designed for each component to learn compact, disentangled latent codes, while a shared 3D convolutional decoder ensures spatiotemporal consistency. We further introduce a novel staged freezing training strategy: first optimizing the keyframe encoder with motion and residual encoders frozen, then alternately unfreezing and fine-tuning all encoders—effectively mitigating feature interference and enhancing both static and dynamic representation fidelity. Extensive experiments on multiple benchmark datasets demonstrate that D-VAE achieves state-of-the-art reconstruction quality with significantly reduced latent dimensionality, striking an optimal balance between reconstruction fidelity and latent space compactness.

Technology Category

Application Category

📝 Abstract

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant latent modeling in video VAEs

Decouples video content into keyframe, motion and residual

Prevents cross-component interference with dedicated encoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples video into keyframe, motion, residual components

Uses dedicated encoders per component with shared 3D decoder

Employs sequential training with partial encoder freezing

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

2024-08-15arXiv.orgCitations: 0

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence