🤖 AI Summary
Existing monocular video-based human digital avatar reconstruction methods suffer from limited representational capacity and sparse observations, hindering faithful recovery of fine-grained dynamic details and geometric-appearance consistency across novel views. To address these challenges, we propose a high-fidelity reconstruction framework augmented with virtual-view synthesis. First, we leverage the Human4DiT video generation model to synthesize multi-view motion supervision signals, effectively compensating for missing real-world observations. Second, we introduce physics-informed identity constraints to enforce pose-shape consistency and design a block-wise denoising strategy to enhance temporal coherence and detail fidelity. Extensive evaluations on multiple benchmarks demonstrate significant improvements over state-of-the-art methods: our approach substantially suppresses reconstruction artifacts and markedly enhances visual realism and structural plausibility—particularly in texture, motion, and geometry—under novel viewpoints.
📝 Abstract
We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.