RSPose: Ranking Based Losses for Human Pose Estimation

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Heatmap-based human pose estimation faces three key challenges: (P1) MSE loss treats all pixel-level deviations equally, hindering precise joint peak localization; (P2) heatmaps suffer from severe spatial and class-wise imbalance; and (P3) the optimization objective (loss function) misaligns with the mAP evaluation metric. To address these, we propose the first mAP-aligned ranking-based loss function, explicitly modeling the coupling between keypoint confidence and localization accuracy—thereby mitigating peak ambiguity and imbalance. Our method integrates seamlessly with mainstream frameworks including ViTPose-H and SimCC. On COCO-val, it achieves 79.9 mAP with ViTPose-H (a new SOTA) and improves SimCC-ResNet50 by 1.5 AP to 73.6. Consistent gains are validated on CrowdPose and MPII, confirming generalizability. The core contribution is an evaluation-driven heatmap learning paradigm that significantly enhances NMS robustness and ensures tight consistency between localization precision and confidence estimation.

Technology Category

Application Category

📝 Abstract
While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) "Commonly used Mean Squared Error (MSE)" Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.
Problem

Research questions and friction points this paper is trying to address.

Addressing heatmap spatial and class imbalance issues
Aligning loss functions with evaluation metric mAP
Improving joint localization accuracy through ranking losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ranking-based losses for pose estimation
Aligns loss functions with mAP evaluation metric
Improves correlation between confidence and localization quality
🔎 Similar Papers
No similar papers found.