VidTwin: Video VAE with Decoupled Structure and Dynamics

📅 2024-12-23

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the low representation efficiency and poor generation controllability in video autoencoding caused by entanglement between structural and dynamic information, this paper proposes a structure-dynamic dual latent space disentanglement paradigm: orthogonal decomposition of the video latent representation into a “structural subspace” encoding global content and slow-motion patterns, and a “dynamic subspace” capturing fine-grained textures and fast-motion details. Methodologically, we introduce a Q-Former to model low-frequency temporal trends and employ spatial average pooling to aggregate high-frequency dynamic features, complemented by hierarchical downsampling to suppress redundancy. On the MCL-JCV dataset, our method achieves 28.14 PSNR at only 0.20% compression rate. It demonstrates robust and efficient performance on downstream generative tasks, enabling interpretable editing and linear interpolation-based control. To the best of our knowledge, this is the first work to achieve explicit, orthogonal, and lightweight disentanglement of video latent spaces—simultaneously optimizing reconstruction fidelity, compression efficiency, and generation controllability.

Technology Category

Application Category

📝 Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Decouples video into structure and dynamics latent spaces

Achieves high compression with quality reconstruction

Enhances explainability and scalability in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples video into structure and dynamics latents

Uses Q-Former for low-frequency motion trends

Averages latent vectors for rapid motion capture

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding