DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular video-based human digital avatar reconstruction methods suffer from limited representational capacity and sparse observations, hindering faithful recovery of fine-grained dynamic details and geometric-appearance consistency across novel views. To address these challenges, we propose a high-fidelity reconstruction framework augmented with virtual-view synthesis. First, we leverage the Human4DiT video generation model to synthesize multi-view motion supervision signals, effectively compensating for missing real-world observations. Second, we introduce physics-informed identity constraints to enforce pose-shape consistency and design a block-wise denoising strategy to enhance temporal coherence and detail fidelity. Extensive evaluations on multiple benchmarks demonstrate significant improvements over state-of-the-art methods: our approach substantially suppresses reconstruction artifacts and markedly enhances visual realism and structural plausibility—particularly in texture, motion, and geometry—under novel viewpoints.

Technology Category

Application Category

📝 Abstract
We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing human avatars from monocular videos with fine details
Generating plausible details at novel viewpoints for avatars
Overcoming limited representational capacity and insufficient data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging video generative model for additional supervision
Injecting physical identity through video fine-tuning
Employing patch-based denoising for higher resolution
🔎 Similar Papers
No similar papers found.