GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular video-based 3D human geometry estimation suffers from temporal inconsistency, loss of dynamic fine-grained details, scarcity of high-quality 4D annotations, and difficulties in metric depth modeling. To address these challenges, we propose the first image-to-video diffusion paradigm tailored for geometric estimation. Our approach innovatively introduces a root-relative depth representation, preserving monocular feasibility while ensuring metric scale accuracy. We design a conditional generative architecture that jointly leverages image depth priors and video diffusion, incorporating root-relative depth encoding and spatiotemporal consistency regularization. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in temporal smoothness, fine-grained motion modeling, and cross-scene generalization.

Technology Category

Application Category

📝 Abstract
Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
Problem

Research questions and friction points this paper is trying to address.

Estimating temporally consistent 3D human geometry from videos
Addressing scarcity of high-quality 4D training data
Overcoming limitations in accurate human size estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-to-video diffusion for temporal consistency
Root-relative depth representation for accurate sizing
Combines image-based and video models efficiently
🔎 Similar Papers
No similar papers found.