Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the entanglement of motion and content in video representation learning. To disentangle them in a self-supervised manner, we propose a novel framework that introduces a low-bitrate vector quantization module to construct an information bottleneck, yielding a discrete and semantically meaningful action space. We employ a Transformer architecture to jointly model frame-level motion dynamics and clip-level content representations implicitly. Training is performed unsupervised via a conditional denoising diffusion model. Crucially, our approach requires no strong prior assumptions yet achieves effective separation of dynamic motion from static content. We validate the method on real-world talking-head videos and 2D cartoon animations, demonstrating its efficacy in motion transfer and autoregressive action generation. Experimental results show substantial improvements in representation generalization and cross-domain adaptability.

Technology Category

Application Category

📝 Abstract
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
Problem

Research questions and friction points this paper is trying to address.

Disentangling video into motion and content components
Developing self-supervised representation learning framework
Creating meaningful discrete motion space with low-bitrate bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture for motion and content features
Low-bitrate vector quantization as information bottleneck
Denoising diffusion model for self-supervised representation learning
X
Xiao Li
Microsoft Research Asia
Q
Qi Chen
Microsoft Research Asia, Shanghai Jiao Tong University, Shanghai Innovation Institute
Xiulian Peng
Xiulian Peng
Researcher at Microsoft Research Asia
deep learningaudio and speechcomputer visionreal-time communicationimage/video coding
K
Kai Yu
Shanghai Jiao Tong University, Shanghai Innovation Institute
X
Xie Chen
Shanghai Jiao Tong University, Shanghai Innovation Institute
Y
Yan Lu
Microsoft Research Asia