PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of missing geometric priors (e.g., camera intrinsics, bone lengths), diverse deployment scenarios (calibrated vs. uncalibrated), and strict model lightweighting requirements in real-time monocular video 3D human pose estimation, this paper proposes a lightweight Transformer architecture. A unified encoder coupled with a learnable masking mechanism enables adaptive integration or suppression of incomplete geometric priors, ensuring robust cross-scenario inference. Synthetic 2D data pretraining and geometry-aware positional encoding further enhance generalization. Evaluated on AMASS, our method achieves 36 mm MPJPE—outperforming the state-of-the-art by 5 mm—while maintaining real-time efficiency: only 380 μs on GPU and 1.8 ms on CPU. To the best of our knowledge, it is the first 3D pose lifting network that simultaneously achieves high accuracy, strong generalizability across calibration conditions, and real-time performance.

Technology Category

Application Category

📝 Abstract
This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera. The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings. To this end, a masking mechanism enables the model to ignore missing priors during training and inference. This yields a single versatile network that can adapt to different deployment scenarios, from fully calibrated lab environments to in-the-wild monocular videos without calibration. The model was trained using 3D keypoints from AMASS dataset with corresponding 2D synthetic data generated by sampling random camera poses and intrinsics. It was then compared to an expert model trained, only on complete priors, and the validation was done by conducting an ablation study. Results show that both, camera and segment length priors, improve performance and that the versatile model outperforms the expert, even when all priors are available, and maintains high accuracy when priors are missing. Overall the average 3D joint center positions estimation accuracy was as low as 36mm improving state of the art by half a centimeter and at a much lower computational cost. Indeed, the proposed model runs in 380$μ$s on GPU and 1800$μ$s on CPU, making it suitable for deployment on embedded platforms and low-power devices.
Problem

Research questions and friction points this paper is trying to address.

Estimates 3D human pose from monocular 2D joints in real-time
Handles both calibrated and uncalibrated camera settings flexibly
Maintains accuracy when geometric priors are partially missing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based lifter for 3D pose estimation
Masking mechanism handles missing geometric priors
Lightweight design enables real-time embedded deployment
🔎 Similar Papers
No similar papers found.