LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Probabilistic human mesh recovery (HMR) from single RGB images suffers from inherent pose-shape ambiguity, making it challenging for existing methods to simultaneously achieve high prediction accuracy and diverse, realistic sample generation. Method: We propose a conditional diffusion model grounded in the SO(3) Lie group, explicitly modeling pose parameters as image-conditioned rotational distributions. Our approach introduces an SO(3)-adapted diffusion process with conditional annealing, integrates a Transformer-based joint feature encoder and an MLP denoiser, and employs conditional dropout to enhance robustness in distribution modeling. Results: Experiments demonstrate that our method significantly outperforms state-of-the-art probabilistic HMR models in pose uncertainty quantification. A single diffusion sample matches the accuracy of top deterministic methods, while consistently improving accuracy, diversity, and realism on benchmarks including AMASS and 3DPW—effectively breaking the long-standing accuracy–diversity trade-off in probabilistic HMR.

Technology Category

Application Category

📝 Abstract

We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce $SO(3)$ diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.

Problem

Research questions and friction points this paper is trying to address.

Modeling 3D human pose ambiguity from single images

Overcoming accuracy-diversity trade-off in probabilistic HMR

Generating well-aligned pose distributions using SO(3) diffusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

SO(3) diffusion model for pose distribution

Transformer extracts hierarchical joint latent vectors

MLP-based denoiser learns per-joint distributions

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos