🤖 AI Summary
This work addresses the problem of real-time monocular video-driven 3D dressed human reconstruction. Methodologically, it introduces a novel temporal propagation mechanism that reformulates pixel-aligned reconstruction networks into a streaming video processing paradigm; designs an updateable canonical appearance representation to enforce inter-frame consistency and enable lightweight fine-tuning; and integrates time-aware feature propagation, canonical-space coordinate mapping, and joint NeRF/implicit surface modeling. The contributions are threefold: (1) significantly improved inference speed—up to 12 FPS—without per-video optimization; (2) preservation of high-fidelity geometry and texture quality; and (3) state-of-the-art performance on standard benchmarks, demonstrating strong generalization across challenging poses and diverse clothing types.
📝 Abstract
Fast 3D clothed human reconstruction from monocular video remains a significant challenge in computer vision, particularly in balancing computational efficiency with reconstruction quality. Current approaches are either focused on static image reconstruction but too computationally intensive, or achieve high quality through per-video optimization that requires minutes to hours of processing, making them unsuitable for real-time applications. To this end, we present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation while maintaining reconstruction quality. Our approach is a"plug-and play"solution that uniquely transforms pixel-aligned reconstruction networks to handle continuous video streams by maintaining and refining a canonical appearance representation through efficient coordinate mapping. Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics while providing high-quality textured reconstruction across diverse pose and appearance, with a maximum speed of 12 FPS.