🤖 AI Summary
This work addresses the entanglement of motion and content in video representation learning. To disentangle them in a self-supervised manner, we propose a novel framework that introduces a low-bitrate vector quantization module to construct an information bottleneck, yielding a discrete and semantically meaningful action space. We employ a Transformer architecture to jointly model frame-level motion dynamics and clip-level content representations implicitly. Training is performed unsupervised via a conditional denoising diffusion model. Crucially, our approach requires no strong prior assumptions yet achieves effective separation of dynamic motion from static content. We validate the method on real-world talking-head videos and 2D cartoon animations, demonstrating its efficacy in motion transfer and autoregressive action generation. Experimental results show substantial improvements in representation generalization and cross-domain adaptability.
📝 Abstract
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.