Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the entanglement of motion and content in video representation learning. To disentangle them in a self-supervised manner, we propose a novel framework that introduces a low-bitrate vector quantization module to construct an information bottleneck, yielding a discrete and semantically meaningful action space. We employ a Transformer architecture to jointly model frame-level motion dynamics and clip-level content representations implicitly. Training is performed unsupervised via a conditional denoising diffusion model. Crucially, our approach requires no strong prior assumptions yet achieves effective separation of dynamic motion from static content. We validate the method on real-world talking-head videos and 2D cartoon animations, demonstrating its efficacy in motion transfer and autoregressive action generation. Experimental results show substantial improvements in representation generalization and cross-domain adaptability.

Technology Category

Application Category

📝 Abstract

We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

Problem

Research questions and friction points this paper is trying to address.

Disentangling video into motion and content components

Developing self-supervised representation learning framework

Creating meaningful discrete motion space with low-bitrate bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture for motion and content features

Low-bitrate vector quantization as information bottleneck

Denoising diffusion model for self-supervised representation learning

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion