🤖 AI Summary
Existing 3D human pose estimation methods for metaverse-like applications suffer from high hardware costs, reliance on active markers, complex camera calibration, and stringent multi-view geometric constraints. To address these limitations, this paper proposes a low-cost, calibration-free, and flexibly deployable binocular vision localization approach. It requires only two off-the-shelf 640×480 USB webcams (total cost ≈ $10) and eliminates rigid geometric assumptions and strict inter-camera synchronization. Our key innovation is the “Mean-of-Means” probabilistic modeling framework: keypoints are modeled as stochastic samples centered around the geometric centroid; the Central Limit Theorem ensures asymptotic normality, avoiding multi-stage error accumulation inherent in SVD-based triangulation. By learning a direct mapping between pixel-coordinate means and world-coordinate means, the method enables lightweight geometric reasoning. Experiments show 95% of localization errors ≤ 0.3 m and nearly 100% ≤ 0.5 m—significantly outperforming existing low-cost alternatives while offering high accuracy, ease of deployment, and strong generalization.
📝 Abstract
Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints. To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 95% within a 0.3m range and nearly 100% accuracy within a 0.5m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640x480 pixels.