PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the challenges of missing geometric priors (e.g., camera intrinsics, bone lengths), diverse deployment scenarios (calibrated vs. uncalibrated), and strict model lightweighting requirements in real-time monocular video 3D human pose estimation, this paper proposes a lightweight Transformer architecture. A unified encoder coupled with a learnable masking mechanism enables adaptive integration or suppression of incomplete geometric priors, ensuring robust cross-scenario inference. Synthetic 2D data pretraining and geometry-aware positional encoding further enhance generalization. Evaluated on AMASS, our method achieves 36 mm MPJPE—outperforming the state-of-the-art by 5 mm—while maintaining real-time efficiency: only 380 μs on GPU and 1.8 ms on CPU. To the best of our knowledge, it is the first 3D pose lifting network that simultaneously achieves high accuracy, strong generalizability across calibration conditions, and real-time performance.

Technology Category

Application Category

📝 Abstract

This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera. The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings. To this end, a masking mechanism enables the model to ignore missing priors during training and inference. This yields a single versatile network that can adapt to different deployment scenarios, from fully calibrated lab environments to in-the-wild monocular videos without calibration. The model was trained using 3D keypoints from AMASS dataset with corresponding 2D synthetic data generated by sampling random camera poses and intrinsics. It was then compared to an expert model trained, only on complete priors, and the validation was done by conducting an ablation study. Results show that both, camera and segment length priors, improve performance and that the versatile model outperforms the expert, even when all priors are available, and maintains high accuracy when priors are missing. Overall the average 3D joint center positions estimation accuracy was as low as 36mm improving state of the art by half a centimeter and at a much lower computational cost. Indeed, the proposed model runs in 380$μ$s on GPU and 1800$μ$s on CPU, making it suitable for deployment on embedded platforms and low-power devices.

Problem

Research questions and friction points this paper is trying to address.

Estimates 3D human pose from monocular 2D joints in real-time

Handles both calibrated and uncalibrated camera settings flexibly

Maintains accuracy when geometric priors are partially missing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based lifter for 3D pose estimation

Masking mechanism handles missing geometric priors

Lightweight design enables real-time embedded deployment

🔎 Similar Papers

No similar papers found.