Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing video generation models struggle to precisely control the temporal details of complex human motions through text alone, while explicit skeleton-based control requires users to provide lengthy pose sequences, which is labor-intensive and costly. To address this, this work proposes a two-stage cascaded framework: first, an autoregressive text-to-2D-pose model generates motion sequences from textual descriptions; then, a pose-conditioned video diffusion model synthesizes high-quality videos by combining these pose sequences with a reference image. The approach introduces a novel DINO-ALF multi-level reference encoding mechanism to maintain appearance consistency under large pose variations and constructs the first synthetic dataset comprising 2,000 highly controllable acrobatic motion clips. Experiments demonstrate that the proposed method significantly outperforms existing approaches in both pose generation and video synthesis on the newly curated dataset and the Motion-X Fitness benchmark.

Technology Category

Application Category

📝 Abstract

Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

Problem

Research questions and friction points this paper is trying to address.

complex human motion

video generation

text-to-skeleton

motion control

pose conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-skeleton

cascaded generation

DINO-ALF