SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the challenge of 3D reconstruction of close-proximity human interactions from monocular video, which is often hindered by severe occlusions leading to pose ambiguity, temporal inconsistency, and erroneous spatial relationships. To tackle these issues, the authors propose the first diffusion-based framework that integrates semantic guidance with geometric constraints. The approach first leverages a vision-language model to generate high-level interaction semantics that guide motion inpainting, then employs a sequence-level temporal optimizer incorporating contact-aware geometric constraints to ensure smooth and physically plausible reconstructions. Evaluated on multiple interaction benchmarks, the method significantly outperforms existing approaches and demonstrates strong generalization capabilities on both unseen datasets and real-world scenarios.

Technology Category

Application Category

📝 Abstract
Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

3D human reconstruction
monocular video
human interaction
mutual occlusion
motion ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based reconstruction
semantic-guided motion infilling
geometric constraints
monocular human interaction
3D human mesh
🔎 Similar Papers
No similar papers found.