VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Monocular human mesh reconstruction suffers from inherent ambiguities, and existing diffusion models often generate multi-hypothesis outputs that lack physical plausibility or consistency with the input image. To address this, this work proposes a dual-memory-augmented HMR critic agent that leverages a self-reflection mechanism to produce context-aware quality scores, enabling the construction of a crowd-sourced preference dataset. For the first time, a vision-language model (VLM)-guided group preference alignment fine-tuning strategy is introduced to jointly optimize fine-grained quality assessment and diffusion-based reconstruction. The proposed method significantly outperforms current approaches across multiple benchmarks, achieving notably improved physical plausibility and image consistency—particularly in scenarios involving occlusion and complex backgrounds.

Technology Category

Application Category

📝 Abstract

Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

Problem

Research questions and friction points this paper is trying to address.

Human Mesh Recovery

Diffusion Models

Ambiguity

Physical Plausibility

Image Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based human mesh recovery

preference alignment

dual-memory critique agent