DisMo: Disentangled Motion Representations for Open-World Motion Transfer

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Current text-to-video (T2V) and image-to-video (I2V) models lack explicit, content-decoupled motion representations, severely limiting motion transfer and editing capabilities. To address this, we propose a self-supervised, alignment-free method for learning abstract motion representations: by reconstructing targets in image space and incorporating a lightweight adapter, motion is fully disentangled from static factors—including appearance, identity, and pose. Our representation enables open-world motion transfer across semantically disparate categories without requiring pixel- or instance-level correspondences, and can be seamlessly integrated—plug-and-play—into arbitrary video generators. In zero-shot action classification, our method significantly outperforms state-of-the-art representation models such as V-JEPA, achieving new SOTA results on Something-Something v2 and Jester. Crucially, it preserves both motion fidelity and text-video alignment.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo

Problem

Research questions and friction points this paper is trying to address.

Separating motion representation from content in video generation models

Enabling motion transfer across semantically unrelated entities without correspondences

Overcoming trade-offs between motion fidelity and prompt adherence in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns disentangled motion representations from raw video

Enables motion transfer across semantically unrelated entities

Combines with existing video generators via lightweight adapters

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence