🤖 AI Summary
Existing Reasoning-based Pose Estimation (RPE) benchmarks suffer from severe reproducibility and evaluation quality issues: misaligned image indices in the 3DPW dataset complicate and error-prone ground-truth (GT) matching; image redundancy, scene imbalance, simplistic pose distributions, and ambiguous textual descriptions further undermine evaluation reliability. This work is the first to systematically identify and analyze these flaws. We propose a vision-calibrated, refined annotation protocol—manually re-aligning 3DPW images with precise GT annotations—and release an open-source, high-fidelity GT dataset. Evaluation employs standard metrics including MPJPE and PA-MPJPE, drastically reducing manual matching effort while significantly improving evaluation consistency and reproducibility for multimodal large models on human pose understanding tasks. Our approach establishes a new paradigm for constructing fair, reliable, and scene-balanced RPE benchmarks.
📝 Abstract
The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.