🤖 AI Summary
Reconstructing close-range human interactions in unconstrained videos remains challenging due to motion blur and inter-person occlusions. To address this, we propose a dual-branch diffusion-based optimization framework. Methodologically, we introduce sociological proxemics theory—modeling social-distance priors—into human pose estimation for the first time; jointly leverage appearance features and temporal motion cues; and enforce physical plausibility via 3D Gaussian rendering, 2D keypoint alignment, and mesh-penetration constraints, all optimized end-to-end in a differentiable tensor space. Our contributions are threefold: (1) the first diffusion-driven interactive reconstruction model incorporating social-distance priors; (2) a large-scale interaction video dataset with pseudo-ground-truth annotations; and (3) significant improvements in pose accuracy and physical feasibility on multiple benchmarks under complex scenarios, establishing new state-of-the-art performance and validating the efficacy of our approach.
📝 Abstract
Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.