DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of multi-view human mesh reconstruction, which is hindered by annotation noise in real-world data and domain shift from synthetic data. To overcome these limitations, the authors propose a diffusion-based zero-shot transfer framework trained exclusively on synthetic data. The method employs a multi-conditioning mechanism to generate view-consistent, pixel-aligned dense proxy representations of the human body. Additionally, it incorporates a vision-prompt-guided hand detail enhancement module and an uncertainty-aware test-time optimization strategy. Evaluated on five real-world benchmarks, the approach achieves state-of-the-art performance, demonstrating significant improvements in reconstruction accuracy—particularly under challenging conditions such as occlusions and partial viewpoints.

Technology Category

Application Category

📝 Abstract
Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models'training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html
Problem

Research questions and friction points this paper is trying to address.

human mesh recovery
multi-view images
domain gap
imperfect annotations
synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based generative priors
Multi-view consistent proxies
Hand refinement with visual prompts
Uncertainty-aware test-time scaling
Zero-shot generalization
🔎 Similar Papers
No similar papers found.
R
Renke Wang
PCA Lab, Nanjing University of Science and Technology, China
Zhenyu Zhang
Zhenyu Zhang
Associate Professor, Nanjing University, Suzhou campus
Digital Human3D Vision
Y
Ying Tai
Nanjing University, School of Intelligent Science and Technology
Jian Yang
Jian Yang
Prof. of Computer Science, Nanjing University of Science and Technology
Pattern RecognitionComputer VisionBiometrics