🤖 AI Summary
To address the heavy reliance of radar-based indoor 3D human pose estimation on costly, labor-intensive fine-grained 3D keypoint annotations, this paper proposes a weakly supervised learning framework requiring only easily obtainable 3D bounding boxes and 2D keypoint labels. Methodologically, we design a two-stage pose decoder incorporating pseudo-3D deformable attention to fuse multi-view radar features, and introduce a 3D template loss and a 3D gravity loss to mitigate depth ambiguity. Evaluated on the HIBER and MMVR datasets, our method reduces joint position error by 34.3% and 76.9%, respectively, outperforming existing approaches significantly. To the best of our knowledge, this is the first work to systematically tackle weakly supervised 3D pose estimation in the radar modality, substantially lowering annotation overhead and advancing practical deployment.
📝 Abstract
Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose extbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3%$ on HIBER and $76.9%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.