🤖 AI Summary
This work addresses the challenge of high-fidelity, temporally consistent 3D motion reconstruction of non-rigid objects—such as clothed humans—from sparse, unstructured, and partially occluded observations. We propose the first end-to-end jointly optimized framework integrating implicit neural fields with explicit mesh deformation. Our method introduces face-level, differential-geometry-driven near-isometric constraints to enforce spatiotemporal consistency in deformations. It further incorporates temporal feature fusion and differentiable rendering optimized by monocular depth video, eliminating reliance on predefined parametric human models. Evaluated on human and animal motion reconstruction tasks, our approach significantly outperforms state-of-the-art methods, achieving substantial improvements in geometric detail preservation and temporal smoothness. Notably, it is the first method to enable high-quality non-rigid motion reconstruction solely from monocular depth video.
📝 Abstract
We introduce a novel, data-driven approach for reconstructing temporally coherent 3D motion from unstructured and potentially partial observations of non-rigidly deforming shapes. Our goal is to achieve high-fidelity motion reconstructions for shapes that undergo near-isometric deformations, such as humans wearing loose clothing. The key novelty of our work lies in its ability to combine implicit shape representations with explicit mesh-based deformation models, enabling detailed and temporally coherent motion reconstructions without relying on parametric shape models or decoupling shape and motion. Each frame is represented as a neural field decoded from a feature space where observations over time are fused, hence preserving geometric details present in the input data. Temporal coherence is enforced with a near-isometric deformation constraint between adjacent frames that applies to the underlying surface in the neural field. Our method outperforms state-of-the-art approaches, as demonstrated by its application to human and animal motion sequences reconstructed from monocular depth videos.