🤖 AI Summary
Existing monocular 4D reconstruction methods for human-scene interaction either neglect motion-specific human dynamics—leading to incomplete reconstructions—or lack inter-component information coordination, causing spatial inconsistencies at boundaries and visual artifacts. To address these issues, we propose a “separate-then-map” framework: human and scene dynamics are first modeled independently; subsequently, a shared transformation function unifies their Gaussian attributes, enabling efficient, lightweight information mapping and fusion. This design avoids costly global interaction modeling, preserving inference efficiency while significantly mitigating boundary distortion. Built upon 4D neural rendering, our method achieves state-of-the-art performance across multiple monocular video datasets. Notably, it delivers substantial improvements in geometric accuracy and visual fidelity—particularly at interaction boundaries—demonstrating superior reconstruction quality for dynamic human-scene interactions.
📝 Abstract
Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {f Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.