PEAR: Pixel-aligned Expressive humAn mesh Recovery

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high-fidelity 3D human reconstruction from a single in-the-wild image, where existing methods suffer from slow inference, coarse pose estimation, and inaccurate hand and facial details. The authors propose PEAR, a lightweight, unified Vision Transformer (ViT) framework that jointly estimates SMPL-X and scaled-FLAME parameters in an end-to-end manner without requiring any preprocessing, enabling real-time inference at over 100 FPS. Pixel-level geometric supervision is introduced to refine fine-grained details, while a modular annotation strategy enhances data robustness. Extensive experiments demonstrate that PEAR significantly outperforms state-of-the-art approaches across multiple benchmarks, achieving substantial improvements in both body pose and facial expression reconstruction accuracy.

Technology Category

Application Category

📝 Abstract
Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: https://wujh2001.github.io/PEAR
Problem

Research questions and friction points this paper is trying to address.

3D human mesh reconstruction
single image
SMPLX
fine-grained details
in-the-wild images
Innovation

Methods, ideas, or system contributions that make the work stand out.

pixel-aligned supervision
real-time SMPLX inference
ViT-based human mesh recovery
modular data annotation
expressive human modeling
🔎 Similar Papers
No similar papers found.