SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of 3D reconstruction of close-proximity human interactions from monocular video, which is often hindered by severe occlusions leading to pose ambiguity, temporal inconsistency, and erroneous spatial relationships. To tackle these issues, the authors propose the first diffusion-based framework that integrates semantic guidance with geometric constraints. The approach first leverages a vision-language model to generate high-level interaction semantics that guide motion inpainting, then employs a sequence-level temporal optimizer incorporating contact-aware geometric constraints to ensure smooth and physically plausible reconstructions. Evaluated on multiple interaction benchmarks, the method significantly outperforms existing approaches and demonstrates strong generalization capabilities on both unseen datasets and real-world scenarios.

Technology Category

Application Category

📝 Abstract

Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

3D human reconstruction

monocular video

human interaction

mutual occlusion

motion ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based reconstruction

semantic-guided motion infilling

geometric constraints

monocular human interaction

3D human mesh

🔎 Similar Papers

No similar papers found.