AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
This work addresses the challenge of reconstructing globally consistent 3D human motion and human-object interactions (HOI) from in-the-wild videos under dynamic camera conditions, particularly for rare or complex actions. The authors propose a two-stage framework: first, synthesizing multi-view 2D motion data from single-view 2D keypoints to adequately represent uncommon actions; then training a camera-conditioned multi-view 2D diffusion model to jointly reconstruct 3D human poses and HOI in world coordinates. By integrating 2D diffusion models with synthetically generated multi-view data, this approach overcomes the limitations of prior methods that rely on static cameras and frequent action priors. It achieves state-of-the-art performance on challenging scenarios such as gymnastics and complex real-world interactions, producing more realistic and globally coherent 3D reconstructions.

Technology Category

Application Category

📝 Abstract
Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
Problem

Research questions and friction points this paper is trying to address.

3D human motion reconstruction
human-object interaction
Internet videos
dynamic cameras
motion capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

2D diffusion
3D motion reconstruction
human-object interaction
multi-view synthesis
Internet videos