🤖 AI Summary
To address low 3D human localization accuracy and severe error accumulation under uncalibrated, non-rigid camera deployments, this paper proposes a low-cost binocular vision localization method that requires no camera calibration and imposes no strict hardware constraints. The core innovation is a “mean-of-means” probabilistic modeling framework: leveraging the Central Limit Theorem, joint locations are modeled as stochastic samples distributed around the human geometric center—abandoning conventional SVD-based multi-stage pose estimation and rigid-perspective assumptions. By jointly learning the mean mapping between world coordinates and pixel coordinates, and integrating billion-scale Monte Carlo sampling, the method achieves robust 3D localization. Using only two off-the-shelf 640×480 webcams (USD 10 each), it attains 96% accuracy (error ≤ 0.3 m) and 99.8% accuracy (error ≤ 0.5 m), significantly outperforming existing label-free visual approaches.
📝 Abstract
Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints.To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 96% within a 0.3$m$ range and nearly 100% accuracy within a 0.5$m$ range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640$ imes$480 pixels.