Dynamic Avatar-Scene Rendering from Human-centric Context

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Existing monocular 4D reconstruction methods for human-scene interaction either neglect motion-specific human dynamics—leading to incomplete reconstructions—or lack inter-component information coordination, causing spatial inconsistencies at boundaries and visual artifacts. To address these issues, we propose a “separate-then-map” framework: human and scene dynamics are first modeled independently; subsequently, a shared transformation function unifies their Gaussian attributes, enabling efficient, lightweight information mapping and fusion. This design avoids costly global interaction modeling, preserving inference efficiency while significantly mitigating boundary distortion. Built upon 4D neural rendering, our method achieves state-of-the-art performance across multiple monocular video datasets. Notably, it delivers substantial improvements in geometric accuracy and visual fidelity—particularly at interaction boundaries—demonstrating superior reconstruction quality for dynamic human-scene interactions.

Technology Category

Application Category

📝 Abstract

Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {f Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing dynamic humans interacting with environments from monocular videos

Addressing spatial inconsistencies and visual artifacts at human-scene boundaries

Bridging separately modeled components while maintaining computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Separate-then-Map strategy bridges separately optimized models

Shared transformation function unifies Gaussian attributes efficiently

Mapping mechanism ensures spatial coherence at interaction boundaries

🔎 Similar Papers

Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video