PromptHMR: Promptable Human Mesh Recovery

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing methods suffer from limited accuracy in crowded scenes, multi-person interactions, and monocular reconstruction, while struggling to jointly capture global image context and fine-grained human geometry. This paper introduces the first promptable human mesh recovery framework, unifying spatial (bounding boxes, masks) and semantic (natural language descriptions, interaction labels) multimodal prompts for end-to-end full-image input. Built upon a Transformer architecture, our method integrates a vision encoder, a prompt embedding module, and a parametric decoder—enabling, for the first time, language-driven fine-grained body shape optimization and joint modeling of multi-person interactions. We achieve state-of-the-art performance across multiple benchmarks: recovering complete 3D human meshes even from face-scale bounding boxes, maintaining temporal consistency in video sequences, and improving body shape estimation accuracy by 12.3% under language guidance.

Technology Category

Application Category

📝 Abstract

Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary"side information"that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

Problem

Research questions and friction points this paper is trying to address.

Enhances human pose estimation in crowded scenes

Incorporates spatial and semantic prompts for accuracy

Maintains scene context while processing full images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based promptable method for HPS estimation

Utilizes spatial and semantic prompts for flexibility

Processes full images to maintain scene context

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos