Lifting Motion to the 3D World via 2D Diffusion

📅 2024-11-27
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of reconstructing globally consistent 3D human motion—including joint rotations and root trajectory—in world coordinates from monocular 2D pose sequences, without any 3D ground-truth supervision. We propose MVLift, a novel framework that introduces a multi-view consistency generation paradigm grounded in a 2D motion diffusion model. Our method integrates multi-stage view-consistency modeling, unsupervised 3D geometric constraint optimization, and cross-domain pose representation learning. To our knowledge, MVLift is the first approach to achieve full 3D motion estimation in world coordinates under purely 2D supervision. Extensive experiments on five benchmark datasets demonstrate state-of-the-art performance—surpassing even methods relying on 3D supervision—while significantly improving generalization to complex human motions, human-object interactions, animal locomotion, and out-of-distribution scenarios. MVLift effectively alleviates the long-standing bottlenecks of 3D annotation dependency and limited cross-domain generalizability.

Technology Category

Application Category

📝 Abstract
Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
Problem

Research questions and friction points this paper is trying to address.

Estimating 3D motion from 2D pose sequences
Overcoming dependency on 3D ground truth data
Generalizing to diverse domains without 3D supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D diffusion models for 3D motion estimation
Multi-stage framework for consistent 2D pose generation
No 3D supervision required, generalizes across domains
Jiaman Li
Jiaman Li
Amazon FAR
Computer VisionComputer GraphicsRobotics
C
C. K. Liu
Stanford University
J
Jiajun Wu
Stanford University