MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

๐Ÿ“… 2025-12-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing motion capture methods rely on predefined skeletal templates, limiting generalization to arbitrary bone structures. This work introduces Category-Agnostic Motion Capture (CAMoCap), the first framework enabling monocular video-driven rotational animation reconstruction (e.g., BVH) for arbitrary rigged 3D assetsโ€”without species- or skeleton-specific priors. Our approach employs a reference-guided, decoupled architecture: a reference prompt encoder and video feature extractor jointly model cross-asset semantics; a unified motion decoder generates joint rotation sequences; and a constraint-aware lightweight IK solver performs asset-specific inverse kinematics and cross-skeleton retargeting. Trained on Truebones Zoo (1,038 skeleton-mesh-rendering triplets), CAMoCap achieves high-fidelity animation on both in-domain benchmarks and in-the-wild videos. It significantly improves generalization for heterogeneous skeleton retargeting and enables scalable, prompt-driven universal motion capture.

Technology Category

Application Category

๐Ÿ“ Abstract
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
Problem

Research questions and friction points this paper is trying to address.

Enables motion capture for arbitrary 3D skeletons from monocular videos
Reconstructs rotation-based animations to directly drive specific assets
Achieves cross-species motion retargeting across heterogeneous rigs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts 3D joint trajectories from monocular video input
Uses reference-guided modules to encode arbitrary skeleton prompts
Recovers asset-specific rotations via constraint-aware inverse kinematics
๐Ÿ”Ž Similar Papers
No similar papers found.
Kehong Gong
Kehong Gong
National University of Singapore
digital humandeep leanring
Z
Zhengyu Wen
Huawei Central Media Technology Institute
W
Weixia He
Huawei Central Media Technology Institute
M
Min Xu
Huawei Central Media Technology Institute
Q
Qi Wang
Huawei Central Media Technology Institute
N
Ning Zhang
Huawei Central Media Technology Institute
Zhengyu Li
Zhengyu Li
Peking University
Quantum Cryptography
Dongze Lian
Dongze Lian
Huawei Central Media Technology Institute
W
Wei Zhao
Huawei Central Media Technology Institute
X
Xiaoyu He
Huawei Central Media Technology Institute
M
Mingyuan Zhang
Huawei Central Media Technology Institute