🤖 AI Summary
Existing 3D human pose estimation methods largely neglect natural language descriptions—a rich, readily available semantic prior—making it challenging to model physical contact (e.g., human–human interaction and self-contact) in markerless, label-free settings. This work introduces the first semantic-driven framework leveraging large language models (LLMs) and multimodal models (LMMs): it automatically parses natural language contact descriptions into differentiable contact semantics and integrates a contact-aware loss into the 3D pose optimization pipeline. Its core innovation lies in using LLMs as zero-shot contact priors, eliminating reliance on manual annotations or motion-capture data. Experiments demonstrate high-fidelity, physically plausible pose reconstruction in both two-person interaction and self-contact scenarios, significantly advancing unsupervised performance. The code is publicly available.
📝 Abstract
Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. Our approach extracts contact-relevant descriptors from an LMM and translates them into tractable losses to constrain 3D human pose optimization. Despite its simplicity, our method produces compelling reconstructions for both two-person interactions and self-contact scenarios, accurately capturing the semantics of physical and social interactions. Our results demonstrate that LMMs can serve as powerful tools for contact prediction and pose estimation, offering an alternative to costly manual human annotations or motion capture data. Our code is publicly available at https://prosepose.github.io.