Joint Optimization for 4D Human-Scene Reconstruction in the Wild

📅 2025-01-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the problem of jointly reconstructing dynamic human motion and 3D scene geometry from in-the-wild monocular internet videos—without ground-truth annotations or explicit physical constraints. To tackle these challenges, we propose JOSH3R, the first 4D (3D + time) human-scene co-modeling framework that integrates contact-aware geometric optimization with self-supervised learning. Our method introduces a novel human-scene contact constraint-driven joint optimization framework and a pseudo-labeling scheme that requires no manual annotation. It unifies optimization-based human mesh recovery, dense scene reconstruction, and physically grounded contact modeling. Quantitatively and qualitatively, JOSH3R achieves state-of-the-art performance in both global human pose estimation and scene geometry accuracy. Crucially, JOSH3R is trained solely on pseudo-labels generated by its own JOSH module, outperforming all optimization-free baselines while demonstrating strong generalization and computational efficiency.

Technology Category

Application Category

📝 Abstract

Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.

Problem

Research questions and friction points this paper is trying to address.

4D Reconstruction

Human-Environment Interaction

Action Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D Reconstruction

Global Motion Prediction

Pseudo-label Training

🔎 Similar Papers

Uncertainty-Aware Testing-Time Optimization for 3D Human Pose Estimation

2024-02-04arXiv.orgCitations: 0

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)