Adaptive 1D Video Diffusion Autoencoder

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing video autoencoders are limited by fixed compression ratios, the rigidity of CNN architectures, and detail loss due to deterministic decoding. This work proposes One-DVA, a novel framework that, for the first time, enables variable-length latent representations and adaptive compression ratios in video autoencoding. It employs a query-based Vision Transformer for one-dimensional latent encoding, coupled with a pixel-level diffusion Transformer for conditional reconstruction. The approach further incorporates latent space regularization and a two-stage training strategy. One-DVA achieves reconstruction quality on par with 3D-CNN VAEs at equivalent compression ratios, while supporting significantly higher compression rates with markedly reduced generation artifacts, thereby enhancing performance in downstream generative tasks.

Technology Category

Application Category

📝 Abstract

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.

Problem

Research questions and friction points this paper is trying to address.

video autoencoder

fixed-rate compression

variable-length latent

deterministic decoder

adaptive compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive compression

1D latent representation

diffusion decoder

vision transformer

variable-length modeling

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0