Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of learning physically plausible 3D character controllers directly from monocular 2D video keypoints—without requiring scarce 3D motion-capture data or pre-trained motion reconstruction models. Methodologically, it introduces Mimic2DM, a hierarchical control framework: an upper-level Transformer autoregressively generates diverse 2D action sequences, while a lower-level module implicitly models 3D motion via multi-view aggregation and jointly optimizes reprojection loss and physics-based simulation rewards for end-to-end training. Its key contribution is the first demonstration that a physically consistent and generalizable 3D control policy can be trained using *only* 2D keypoint supervision. Experiments on complex human-object interaction (HOI) and non-human character tasks—including dance, soccer dribbling, and animal locomotion—validate superior motion diversity, visual realism, and physical plausibility. Mimic2DM significantly improves the practicality and robustness of existing 2D-driven 3D control approaches.

Technology Category

Application Category

📝 Abstract
Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/
Problem

Research questions and friction points this paper is trying to address.

Learning 3D character control from 2D video data
Overcoming limitations of 3D motion reconstruction methods
Generating diverse, physically plausible motions without 3D data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 2D keypoint control from video data
Transformer-based autoregressive 2D motion generation
Hierarchical framework for 3D tracking without 3D data
🔎 Similar Papers
No similar papers found.
J
Jianan Li
The Chinese University of Hong Kong
X
Xiao Chen
The Chinese University of Hong Kong
T
Tao Huang
Shanghai AI Laboratory, Shanghai Jiao Tong University
Tien-Tsin Wong
Tien-Tsin Wong
Professor, Dept of Data Science and Artificial Intelligence, Monash University
Generative AIComputer GraphicsComputational MangaComputer Vision