TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly achieving fine-grained control over camera motion and human pose in portrait video generation, this paper proposes TokenMotion—a DiT-based framework that models camera motion and human pose as spatiotemporally decoupled, editable tokens. We introduce the first portrait-aware dynamic masking mechanism, enabling spatiotemporally adaptive disentanglement and fusion of motion signals. Furthermore, we construct the first text- and image-conditioned video diffusion model supporting triple fine-grained control: over camera motion, human pose, and their mutual interaction. Extensive experiments demonstrate that TokenMotion consistently outperforms state-of-the-art methods on both text-to-video and image-to-video benchmarks. Notably, it achieves substantial improvements in generation fidelity and controllability on complex portrait motion scenarios—e.g., Grammy Glambot—where precise coordination between camera dynamics and subject articulation is critical.

Technology Category

Application Category

📝 Abstract
Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.
Problem

Research questions and friction points this paper is trying to address.

Decoupling camera and human motion control in video generation
Improving integration of fine-grained motion representations
Enhancing human-centric video generation for creative production
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based video diffusion framework
Spatio-temporal tokens for motion control
Decouple-and-fuse strategy with dynamic mask
🔎 Similar Papers
No similar papers found.
R
Ruineng Li
OPPO US AI Center
D
Daitao Xing
OPPO US AI Center
Huiming Sun
Huiming Sun
OPPO US Research Center
Computer VisionRemote SensingSemantic Segmentation
Y
Yuanzhou Ha
J
Jinglin Shen
OPPO US AI Center
C
Chiuman Ho
OPPO US AI Center