RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to independently control foreground subjects, background scenes, 3D human trajectories, and motion patterns in human video generation. This paper introduces the first decoupled 3D human motion control framework that fully separates and freely composes these four factors. To ensure realism, it employs 3D trajectory back-projection and motion alignment. Within a text-to-video diffusion Transformer, it incorporates subject-token full-attention injection, background-channel concatenation, and additive motion-signal fusion, augmented by focal-length calibration and coordinate transformation for precise 3D trajectory editing. Evaluated on multiple benchmarks and real-world scenarios, our method achieves state-of-the-art performance, significantly improving fine-grained controllability and visual fidelity. It enables high-fidelity video synthesis of arbitrary humans performing arbitrary actions in arbitrary scenes.

Technology Category

Application Category

📝 Abstract
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
Problem

Research questions and friction points this paper is trying to address.

Control human motion and video generation separately
Decouple motion from appearance and subject from background
Enable flexible mix-and-match of video elements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed motion control in 3D space
Trajectory editing via 2D-to-3D unprojection
Diffusion-based video generation with token injection
🔎 Similar Papers
No similar papers found.