🤖 AI Summary
Monocular video-based human mesh reconstruction is often hindered by depth ambiguity and scale uncertainty, leading to metric inconsistency and temporal instability. This work proposes a depth-guided multi-scale fusion framework that integrates a D-MAPS pose-and-shape estimator with a MoDAR motion-depth alignment refinement module. By leveraging a cross-modal attention mechanism, the approach effectively fuses geometric priors with RGB features and incorporates depth-calibrated skeletal statistics to enable metric-aware pose estimation. Evaluated on three challenging benchmarks, the method achieves state-of-the-art performance, significantly improving temporal consistency, spatial accuracy, and robustness under severe occlusion while maintaining computational efficiency.
📝 Abstract
Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.