π€ AI Summary
Multi-view 3D human pose estimation suffers from poor generalization, reliance on calibrated cameras, and fixed numbers of views. Method: We propose a novel framework that requires no real 3D annotations, operates without camera calibration, and supports arbitrary multi-view configurations in real-world scenes. Its core innovations include: (i) representing 2D keypoints as 3D rays; (ii) introducing a View Fusion Transformer that fuses geometric information along these rays across views; and (iii) integrating ray-based geometric modeling, self-supervised multi-view consistency constraints, and synthetic-data pretraining for zero-shot cross-configuration deployment. Results: On a new in-the-wild multi-view and multi-person benchmark, our method reduces MPJPE by 53% over traditional triangulation and improves over image-based Transformer baselines by over 60%, demonstrating superior robustness and scalability.
π Abstract
Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation makes the model independent of camera calibration and the number of views, enabling universal deployment across arbitrary multi-view configurations without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Extensive experiments demonstrate that RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability. The framework's source code is available at https://github.com/aghasemzadeh/OpenRUMPL