Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Monocular video-based human mesh reconstruction is often hindered by depth ambiguity and scale uncertainty, leading to metric inconsistency and temporal instability. This work proposes a depth-guided multi-scale fusion framework that integrates a D-MAPS pose-and-shape estimator with a MoDAR motion-depth alignment refinement module. By leveraging a cross-modal attention mechanism, the approach effectively fuses geometric priors with RGB features and incorporates depth-calibrated skeletal statistics to enable metric-aware pose estimation. Evaluated on three challenging benchmarks, the method achieves state-of-the-art performance, significantly improving temporal consistency, spatial accuracy, and robustness under severe occlusion while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

monocular video human mesh recovery

metric consistency

temporal stability

depth ambiguity

scale uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-Guided

Metric-Aware

Temporal Consistency