RAPTR: Radar-based 3D Pose Estimation using Transformer

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the heavy reliance of radar-based indoor 3D human pose estimation on costly, labor-intensive fine-grained 3D keypoint annotations, this paper proposes a weakly supervised learning framework requiring only easily obtainable 3D bounding boxes and 2D keypoint labels. Methodologically, we design a two-stage pose decoder incorporating pseudo-3D deformable attention to fuse multi-view radar features, and introduce a 3D template loss and a 3D gravity loss to mitigate depth ambiguity. Evaluated on the HIBER and MMVR datasets, our method reduces joint position error by 34.3% and 76.9%, respectively, outperforming existing approaches significantly. To the best of our knowledge, this is the first work to systematically tackle weakly supervised 3D pose estimation in the radar modality, substantially lowering annotation overhead and advancing practical deployment.

Technology Category

Application Category

📝 Abstract
Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose extbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3%$ on HIBER and $76.9%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.
Problem

Research questions and friction points this paper is trying to address.

Developing radar-based 3D human pose estimation with weak supervision
Reducing dependency on costly 3D keypoint labels in complex environments
Addressing depth ambiguities using multi-view radar features and losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses weak supervision with 3D BBox and 2D keypoints
Employs two-stage pose decoder with deformable attention
Applies template and gravity losses to resolve ambiguities
🔎 Similar Papers
No similar papers found.
S
Sorachi Kato
Mitsubishi Electric Research Laboratories (MERL), USA
Ryoma Yataka
Ryoma Yataka
Mitsubishi Electric Corporation
computer visionradar perceptiongeometric deep learningmachine learning
P
P. Wang
Mitsubishi Electric Research Laboratories (MERL), USA
Pedro Miraldo
Pedro Miraldo
Mitsubishi Electric Research Laboratories (MERL)
3D Computer VisionRobot VisionActive Vision
Takuya Fujihashi
Takuya Fujihashi
Osaka University
Video StreamingWireless NetworksAcoustic Networks
P
P. Boufounos
Mitsubishi Electric Research Laboratories (MERL), USA